0% found this document useful (0 votes)
38 views

final project documentation

Uploaded by

hutaopyro719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

final project documentation

Uploaded by

hutaopyro719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

IDENTIFYING RISKY BANK LOAN USING PYTHON

A REPORT SUBMITTED
IN PARTIAL FULFILMENT FOR THE DEGREE OF

BACHELOR OF TECHNOLOGY
In
Computer Science and Engineering

By

ARITRA BHATTACHARJEE (201630100120004)


PALLABMIDYA (201630100120023)
ISHITA DAS (039880)
PRITAMDEBNATH (201630100120006)

Pursued in
Department of Computer Science and Engineering

BENGAL INSTITUTE OF TECHNOLOGY AND MANAGEMENT

SANTINIKETAN, WEST BENGAL

To

MAULANA ABUL KALAM AZAD UNIVERSITY OF TECHNOLOGY


KOLKATA, WEST BENGAL
MAY, 2023
CERTIFICATE

This is to certify that the project report entitled Identifying Risky Bank Loan Using Python
submitted by AritraBhattacharjee, PallabMidya, IshitaDas, PritamDebnath to the
Bengal institute of Technology and Management, Santiniketan, in partial fulfilment for the
award of the degree of B.Tech in Computer Science and Engineering is a bonafide record
of project work carried out by them under my supervision. The contents of this report, in full
or in parts, have not been submitted to any other Institution or University for the award of
any degree or diploma.

Prof. Soumen Bhowmik


Supervisor
Dept. Of CSE, BITM

Prof. (Dr.) Subhasis Biswas Prof, Soumen Bhowmik


Director, BITM, Santiniketan HOD, Dept of CSE & BCA
May, 2023 BITM, Santiniketan

i
DECLARATION

I declare that this project report titled Identifying Risky Bank Loan Using Python
submitted in partial fulfilment of the degree of B. Tech in Computer Science and
Engineering is a record of original work carried out by us under the supervision of Prof.
Soumen Bhowmik and has not formed the basis for the award of any other degree or
diploma, in this or any other Institution or University. In keeping with the ethical
practice in reporting scientific information, due acknowledgements have been made
wherever the findings of others have been cited.

Aritra Bhattacharjee
University Roll No: 16300120051
University Registration No: 201630100120004

Pallab Midya
University Roll No:16300120032
University Registration No: 201630100120023

Ishita Das
University Roll No: 16300119023
University Registration No: 039880

Pritam Debnath
University Roll No: 16300120049
University Registration No: 201630100120006

Santiniketan-731236

ii
ACKNOWLEDGMENTS

This work would not have been possible without the constant support, guidlinece, and
assistance of my supervisor Prof. Soumen Bhowmik his levels of patience, knowledge,
and ingenuity are something we will always keep aspiring to.

I take this opportunity to thank our humble guidance and hod of our department of
computer scince and engineering prof. Soumen Bhomik sir of B.I.T.M college, under
whom’s guidance we completed this project of ours. So a large appreciation of ours goes
to him.

I extend my sincere thanks to one and all of BITM family for the completion of this
document on the project report format guidelines.

I wish to express my thanks to all Teachers and Friends of the Department of Computer
Science and Engineering who were helpful in many ways for the completion of the
project.

Aritra Bhattacharjee
Pallab Midya
Ishita Das
Pritam Debnath

iii
ABSTRACT

Loans are no longer considered a last resort to buy a sought-after Smartphone or a dream
house. Over the last decade or so, people have become less hesitant in applying for a
loan, whether it’s personal, vehicle, education, business, or home especially when they
don’t have a lump sum at their disposal. Besides, Home and Education Loans provide tax
advantages that reduce tax liability and increase the cash in hand from salary income. To
get loans with minimal paperwork, quick eligibility checks, and competitive interest
rates. They have opened an online channel to apply and submit documents for the
approval process. If you still find the loan application and review process intimidating,
credit history is indicative of your future repayment behaviour, based on your pattern in
settling past loans. It helps the bank to know if you will be punctual and regular with
your payments. Banks weigh your employment history and current engagement to ensure
that your source of income is reliable. Note that there is no extra spacing between two
successive paragraphs.

iv
TABLE OF CONTENTS

DESCRIPTION Page No.

CERTIFICATE i

DECLARATION ii

ACKNOWLEDGEMENTS iii

ABSTRACT iv

TABLE OF CONTENTS v

LIST OF FIGURES vii

ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE viii

CHAPTER 1: INTRODUCTION 1

1.1 Motivation 2

1.2 Problem definition 2

1.2.1 Importing a working dataset 4

1.2.2 Understanding the big picture 4

1.2.3 Graphical Exploratory Data Analysis Techniques 7

1.2.4 Quantitative Exploratory Data Analysis Techniques 7

1.2.5 Exploratory Data Analysis Process 8

1.3 Objective of the project 9

CHAPTER 2: ANALYSIS 10

2.1 Proposed System 11

2.2 Software requirement specifications 14

2.3 Purpose 15

2.4 Scope 15

CHAPTER 3: WORKFLOW OF THE PROJECT 17

3.1 Data collection 18

3.2 Data cleaning and pre processing 18

3.3 Exploratory data analysis 19

3.4 Feature selection 20

3.5 Model selection 21

3.6 Model training and Evaluation 24

v
3.7 Model deployment 25

CHAPTER 4: IMPLIMENTATION 26

4.1 Introductions to technologies used 27

4.1.1 Python 27

4.1.2 Google colab 28

4.2 Modules & Modules description 28

4.2.1 NumPy 28

4.2.2 Pandas 29

4.2.3 Seaborn 30

4.2.4 Sklearn 31

4.2.5 Train_test Split 32

4.2.6 SVM 32

4.2.7 Accuracy_Score 33

4.3 Sample code 33

SCREENSHOTS 36

CONCLUSION 43

REFARENCES 44

vi
LIST OF FIGURES

FIGURE TITLE PAGE NUMBER

1.1 Problem definition 3

1.2 Comparison between classical and 16

exploratory data analysis

1.3 Interquartile EDA technique 16

2.1 Proposed System Model 14

3.1 data pre processing 28

3.2 System architecture 22

3.3 data visualisation 29

3.4 flow chart 30

3.5 Support vector classifier 27

3.6 Flow chart of SVM model 28

4.1 Technologies used 33

4.2 train_test_split module 41

5.1 library files importing 46

5.2 dataset importing 46

5.3 head() implementation 46

5.4 describe() implementation 47

5.5 isnull() implementation 47

5.6 dropna() implementation 48

5.7 replace()&head() implement 48

5.8 comparison between education 49


and loan status column
5.9 comparison between marriage 49
and loan status column
5.2.1 replace function implementation 50

5.2.2 data separation 50

5.2.3 train_test_split module implementation 50

5.2.4 SVM implementation 51

vii
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE

Utmost care should be taken by the project student while using technical abbreviations, notations and
nomenclature.

The abbreviations should be listed in alphabetical order as shown below.

EDA Exploratory Data Analysis

ML Machine Learning

SVM Support Vector Machine

viii
CHAPTER 1

INTRODUCTION

1
1.1 Motivation:
We have discovered with time and experience that a large number of companies are looking
for insights and value that come from fundamentally descriptive activities. This means that
companies are often willing to allocate resources to acquire the necessary awareness of the
phenomenon that we analysts are going to study.
If we are able to investigate the data and ask the right questions, the EDA process becomes
extremely powerful. By combining data visualization skills, a skilled analyst is able to build a
career only by leveraging these skills. You don’t even have to go into modelling. A good
approach to EDA therefore allows us to provide added value to many business contexts,
especially where our client / boss finds difficulties in the interpretation or access to data. This
is the basic idea that led me to put down such a template.

1.2 Problem definition


Exploratory Data Analysis (EDA) in Python is the first step in your data analysis process
developed by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an
approach to analyzing data sets to summarize their main characteristics, often with visual
methods. By the name itself, we can get to know that it is a step in which we need to explore
the data set.
For Example, You are planning to go on a trip to the “X” location. Things you do before
taking a decision:
You will explore the location on what all places, waterfalls, trekking, beaches, restaurants
that location has in Google, Instagram, Facebook, and other social Websites.
Exploratory data analysis (EDA) is an especially important activity in the routine of a data
analyst or scientist.
It enables an in depth understanding of the dataset, define or discard hypotheses and create
predictive models on a solid basis.
It uses data manipulation techniques and several statistical tools to describe and understand
the relationship between variables and how these can impact business.
In fact, it’s thanks to EDA that we can ask ourselves meaningful questions that can impact
business.
In this article, we will share with you a template for exploratory analysis that I have used over
the years and that has proven to be solid for many projects and domains. This is implemented
through the use of the Pandas library — an essential tool for any analyst working with
Python.

2
Flow Chart

Figure:1.1: Problem Definition

The process consists of several steps:


1. Importing a dataset
2. Understanding the big picture
3. Preparation
4. Understanding of variables
5. Study of the relationships between variables
6. Brainstorming
This template is the result of many iterations and allows me to ask myself meaningful
questions about the data in front of me. At the end of the process, we will be able to
consolidate a business report or continue with the data modeling phase.
The image below shows how the brainstorming phase is connected with that of understanding
the variables and how this in turn is connected again with the brainstorming phase.
This process describes how we can move to ask new questions until we are satisfied.

3
We will see some of the most common and important features of Pandas and also some
techniques to manipulate order to understand it thoroughly.
This process describes how we can move to ask new questions until we are satisfied.
The process of exploratory data analysis.
We will see some of the most common and important features of Pandas and also some
techniques to manipulate the data in order to understand it thoroughly.

1.2.1 Importing a working dataset


The data analysis pipeline begins with the import or creation of a working dataset. The
exploratory analysis phase begins immediately after.
Importing a dataset is simple with Pandas through functions dedicated to reading the data. If
our dataset is a .csv file, we can just use
Df = pd.read_csv(“path/to/my/file.csv”)
Df stands for dataframe, which is Pandas’s object similar to an Excel sheet. This
nomenclature is often used in the field. The read_csv function takes as input the path of the
file we want to read. There are many other arguments that we can specify.
The .csv format is not the only one we can import — there are in fact many others such as
Excel, Parquet and Feather.
For ease, in this example we will use Sklearn to import the wine dataset. This dataset is
widely used in the industry for educational purposes and contains information on the
chemical composition of wines for a classification task. We will not use a .csv but a dataset
present in Sklearn to create the data frame.

1.2.2 Understanding the big picture


In this first phase, our goal is to understand what we are looking at, but without going into
detail. We try to understand the problem we want to solve, thinking about the entire dataset
and the meaning of the variables.
This phase can be slow and sometimes even boring, but it will give us the opportunity to make
an opinion of our dataset.
Let’s take some notes

I usually open Excel or create a text file to put some notes down, in this fashion:

1. Variable: name of the variable

2. Type: the type or format of the variable. This can be categorical, numeric,
Boolean, and so on

4
3. Context: useful information to understand the semantic space of the variable. In
the case of our dataset, the context is always the chemical-physical one, so it’s
easy. In another context, for example that of real estate, a variable could belong
to a particular segment, such as the anatomy of the material or the social one
(how many neighbours are there?)

4. Expectation: how relevant is this variable with respect to our task? We can use
a scale “High, Medium, Low”.

5. Comments: whether or not we have any comments to make on the variable


Of all these, Expectation is one of the most important because it helps us develop the analyst’s
“sixth sense” — as we accumulate experience in the field we will be able to mentally map
which variables are relevant and which are not.
In any case, the point of carrying out this activity is that it enables us to do some preliminary
reflections on our data, which helps us to start the analysis process.
Useful properties and functions in Pandas
We will leverage several Pandas features and properties to understand the big picture. Let’s
see some of them.
.head() and .tail()
Two of the most commonly used functions in Pandas are .head() and .tail(). These two allow
us to view an arbitrary number of rows (by default 5) from the beginning or end of the
dataset. Very useful for accessing a small part of the data frame quickly.
If we apply .shape on the dataset, Pandas returns us a pair of numbers that represent the
dimensionality of our dataset. This property is very useful for understanding the number of
columns and the length of the dataset.
There are also .dtypes and .isna() which respectively give us the data type info and whether
the value is null or not. However, using .info() allows us to access this information with a
single command.
Exploratory Data Analysis (EDA) is best described as an approach to find patterns, spot
anomalies or differences, and other features that best summarise the main characteristics of a
data set.
This approach involves the use of various EDA techniques, many of which include data
visualization methods, to glean insights into the data, validate the assumptions on which we
will base our future inferences, and even determine prudent models which define the data
with the minimum number of variables.
However, it is important to remember that Exploratory Data Analysis is barely a set of
techniques, steps, or rules; rather it is anything but. Quoting straight from the Engineering

5
Statistics Handbook, Exploratory Data Analysis is ‘a philosophy’ towards how the data is to
be analyzed.
And despite its apparent importance (after all, who wouldn’t want more efficient models!), it
is more often than not that, because of this lack of rigid structure and elusive nature, EDA
isn’t used nearly as often as it should be.
Let’s methodically demystify the EDA concept, starting from the very basics of differences
between EDA vs data analysis and moving through to the exploratory data analysis steps and
techniques sprinkled with a few simple Python examples you can try yourself. And hopefully,
by the end of it, you’ll agree that EDA is not as intimidating as it is often made out to be
when working with data science and machine learning projects.
Why not call EDA just plain classical data analysis. If the answer to the above question isn’t
obvious already, let’s just put it out there: Because it’s not!
Since you’ve made it up to here, thankfully, you won’t be left having to take the above
statement with a pinch of salt. While you must have already vaguely sensed some of the
differences between classical data analysis and EDA the following explanation should serve
to give you a clearer picture.
EDA is indeed a data analysis approach, however, it differs starkly from the classical
approach in the very way it seeks to find a solution to a problem, or for that matter the way it
addresses one.
In the classical approach, the model is imposed on the data and the analysis and testing
follow. With EDA, on the other hand, the collected data set is first analyzed to infer what
model would be best suited for the data by investigating its underlying structure.

Figure: 1.2: Comparison between classical and exploratory data analysis

EDA is a data-focused approach – both in its structure and the models it suggests. On the flip
side, classical data analysis is aimed at generating predictions from models and is generally
quantitative in nature. Even the rigidity and formality that are prevalent in classical
techniques are absent in EDA. The two methods differ even by the way they deal with
information in that classical estimation techniques focus only on a few important
characteristics resulting in a loss of information whereas EDA techniques make almost no
assumption and often make use of all available data.

6
1.2.3 Graphical Exploratory Data Analysis Techniques
Some of the commonly used graphical techniques are:

Figure: 1.3: Interquartile EDA technique

Histogram:
Histograms can be used to summarize both continuous and discrete data. They help to
visualize the data distribution. They serve especially well to indicate gaps in data and even
outliers.
The given data provides the percentage of people who prefer a particular proportion of
Nuts.Unlike the previous methods that are univariate (i.e. involving only a single variable)
scatter plot reveals the relationship between two variables. The relationships reveal
themselves in the form of structures in the plot such as lines or curves that cannot simply be
explained as randomness.

1.2.4 Quantitative Exploratory Data Analysis Techniques


Quantitative techniques are very similar to graphical techniques in the data they present and
vary only in the way they present their findings. They are, therefore, less used not because of
their inferiority in quantitative performance but rather by personal preference or convenience.
Some of the commonly used quantitative techniques are:
Variation:
Determining the variance or other related parameters of a data set describes the spread of the
data or how far the values are from the center. Every variable will have its own unique pattern
of variation and investigating this can often lead to interesting findings.
The Analysis of Variation (ANOVA) test is widely used in the testing of experimental data
Hypothesis Testing:
A statement that is assumed to be true unless there is strong evidence contradicting it is called
a ‘statistical hypothesis’. These statements can be certain assumptions regarding the data set.
The process used to determine whether such a proposition is true is termed ‘hypothesis
testing’.

7
Hypothesis testing is accomplished in a series of steps. In this process, a null hypothesis,
which is initially assumed to be true, is replaced by an alternative hypothesis if the testing
results in the null hypothesis being rejected. This is done by comparing a quantitative
measure called the ‘test statistic’, which shows whether sample data is in agreement with the
null hypothesis, to a critical value to decide on the rejection of the null hypothesis.
The EDA techniques we have gone over In this section are by no means an exhaustive list of
techniques that can be used for accomplishing EDA. On venturing to use EDA and exploring
on your own you are bound to discover other techniques and also find that some work for you
better than the others do.

1.2.5 Exploratory Data Analysis Process


Now that we have gone over the techniques and understood their significance, let us move on
to the bigger picture. As you might have already guessed, the process of exploratory data
analysis isn’t what one might call ‘plain sailing’. Rather, it involves an iterative process of:
Questioning your Data:
EDA is a creative process and this lack of a strict set of rules both allows and necessitates that
you be curious. It is especially important during the initial phases of EDA that you explore
every avenue that occurs to you without being deterred by the fact that probably only a few of
them might ultimately lead to fruition.
Visualising, Analysing, and Modelling Data:
Once you have asked the questions, it is necessary that you appropriately visualize and
analyze the data in accordance with the questions in order to gain further insights into your
data.
Exploring New Avenues:
The observations you have made from the previous two steps will open up new avenues of
data exploration allowing you to refine your questions and even make more informed
inquiries.
As this process progresses you are bound to arrive at some particularly productive or
informative findings which will evolve into the results of your exploratory data analysis.
Exploratory Data Analysis:
Before we dive into this section let’s once more reiterate (if it hasn't been done enough
already not discounting the ambiguous heading) that there is no perfect way to go about with
EDA. There are just ways that work and those that don't. The steps listed below are,
therefore, just one of the logical ways you could explore while you get started:

1. Importing and cleaning the imported data: The first step to any analysis is to
import the data, clean and transform it to the required readable format.
2. Univariate analysis: It is logical to start the analysis by considering one
variable at a time, learning each variable's distribution and summary statistics.
3. Pair exploration: The next step would be to identify relationships between
pairs of variables using simple two-dimensional graphs.

8
4. Multivariate analysis: Having analyzed the variables in pairs, the relationships
between larger groups can be analyzed to investigate and identify more
complex relationships.

5. Hypothesis testing and estimation: The assumptions made regarding the data
set can be tested at this stage and estimations are made regarding the
variability of variables.

6. Visualization: Visualisation is used both during exploration and analysis of the


data as well as to effectively communicate the results and findings.

1.3 Objective of the project


Exploratory Data Analysis (EDA) involves using statistics and visualizations to analyze and
identify trends in data sets. The primary intent of EDA is to determine whether a predictive
model is a feasible analytical tool for business challenges or not. EDA helps data scientists
gain an understanding of the data set beyond the formal modeling or hypothesis testing task.
Exploratory Data Analysis is essential for any research analysis, so as to gain insights into a
data set. In this article, let’s take a look at the importance, purpose, and objective of
Exploratory Data Analysis that an analyst would want to extract from a data set.

Learn everything there is to know about Exploratory Data Analysis in this blog, a method for
evaluating and drawing conclusions from large amounts of data.

Exploratory Data Analysis is essential for any business. It allows data scientists to analyze
the data before coming to any assumption. It ensures that the results produced are valid and
applicable to business outcomes and goals.

Importance of using EDA for analyzing data sets is:

 Helps identify errors in data sets.


 Gives a better understanding of the data set.
 Helps detect outliers or anomalous events.
 Helps understand data set variables and the relationship among them.

The primary objective of Exploratory Data Analysis in order to perform Exploratory Data
Analysis is to uncover the underlying structure. The structure of the various data sets
determines the trends, patterns, and relationships among them. A business cannot come to a
final conclusion or draw assumptions from a huge quantity of data and rather requires taking
an exhaustive look at the data set through an analytical lens.

Therefore, performing an Exploratory Data Analysis allows data scientists to detect errors,
debunk assumptions, and much more to ultimately select an appropriate predictive model.

9
CHAPTER 2

ANALYSIS

10
2.1 Proposed system
To making a project of risky bank loan we use different types of latest algorithms which helps
us to understand what types of risk would be happened during any bank loan by using
algorithms the technologies that are mainly proposed in our project.

Figure: 2.1: Proposed System Model

Bank loan dataset:


The bank loan dataset that contains information about bank loans and their characteristics. A
typical bank loan dataset may contain a variety of information, including:
1. Loan amount: This refers to the amount of money that the borrower has requested from
the bank.
2. Loan term: This is the length of time over which the loan will be repaid.
3. Interest rate: This is the percentage of the loan amount that the borrower will be
charged as interest.
4. Loan purpose: This refers to the reason why the borrower is requesting the loan, such
as to purchase a home or a car.
5. Borrower information: This includes data about the borrower, such as their income,
credit score, employment history, and other relevant factors that may impact their
ability to repay the loan.
6. Collateral: This refers to any assets that the borrower pledges as security for the loan,
such as a home or a car.
7. Loan status: This indicates whether the loan is still outstanding or has been repaid.
8. Default status: This indicates whether the borrower has failed to make payments on the
loan, resulting in default.

11
With this type of dataset, you could explore various questions related to bank loans, such as
what factors are associated with loan default, which loan characteristics are most strongly
correlated with interest rates, or how loan amounts vary by borrower demographics.
Data processing:
Bank loan Data processing typically involves the collection, analysis, and management of
information related to loan applications and approvals. This process may include several
steps, such as verifying the borrower's creditworthiness, assessing the risk of lending money,
and determining the terms and conditions of the loan.
Here are some of the key steps involved in bank loan data processing:
1. Application submission: Borrowers submit loan applications, either in person or
online, providing details about their financial situation, employment, and credit
history.
2. Credit check: The bank checks the borrower's credit score and credit history to
determine their creditworthiness. This involves reviewing the borrower's payment
history, outstanding debts, and other financial information.
3. Income verification: The bank verifies the borrower's income, employment status,
and other financial information to assess their ability to repay the loan.
4. Loan assessment: Based on the borrower's creditworthiness and ability to repay the
loan, the bank assesses the risk of lending money and determines the terms and
conditions of the loan.
5. Loan approval or rejection: The bank either approves or rejects the loan application
based on the borrower's creditworthiness, ability to repay the loan, and other factors.
6. Loan disbursal: If the loan is approved, the bank disburses the funds to the borrower.
7. Loan management: The bank manages the loan account, including collecting
payments, tracking the borrower's payment history, and handling any issues that
arise.
Throughout this process, the bank may use various software applications to help automate
and streamline the loan processing workflow, such as loan origination systems, credit scoring
models, and loan servicing platforms. Additionally, data privacy and security are important
considerations in bank loan data processing, as banks must protect borrower information and
comply with applicable regulations.
Training Dataset:
A training dataset in the context of bank loans would typically include information about past
loan applications, including details such as the applicant's credit history, income, employment
status, loan amount requested, and whether the loan was approved or denied.
The purpose of using a training dataset in this context would be to build a machine learning
model that can predict the likelihood of a new loan application being approved or denied
based on the applicant's characteristics and loan details. The model would be trained on the
historical loan data to identify patterns and relationships between these variables and loan

12
approval outcomes, and would then use these patterns to make predictions on new loan
applications.
It is important to note that the accuracy and effectiveness of the model would depend heavily
on the quality and relevance of the training dataset used. The dataset should be representative
of the population being considered for loans and should be large enough to capture a diverse
range of loan scenarios and outcomes. Additionally, the dataset should be properly labelled
and cleaned to ensure that the model is not biased by inaccurate or incomplete data.
Test Dataset:
A test dataset in a risky bank loan scenario would typically contain a set of data points
representing loan applications that have been approved or denied based on certain criteria.
The dataset may include a variety of features such as the applicant's credit score, income
level, employment status, debt-to-income ratio, loan amount, loan term, and other relevant
information.
The purpose of the test dataset is to evaluate the effectiveness of a machine learning model or
algorithm in predicting the likelihood of loan default or delinquency based on the available
features. The model may be trained on a separate dataset and then tested on this new dataset
to determine its accuracy and generalize ability.
It is important to ensure that the test dataset is representative of the population of loan
applications that the model is intended to be applied to. The dataset should also be carefully
curetted to prevent any biases or inconsistencies that could affect the accuracy of the model's
predictions.
Model:
The process of making a decision about whether or not to approve a risky bank loan typically
involves several steps and considerations. Here is a general overview of the model-making
process:
Assessment of the borrower's creditworthiness:
The first step is to evaluate the borrower's creditworthiness by examining their credit history,
income, debt-to-income ratio, employment status, and other relevant financial information.
This assessment helps determine the likelihood of the borrower being able to repay the loan.
Analysis of the loan terms:
The bank will analyze the proposed loan terms, such as the interest rate, repayment period,
and collateral requirements, to determine whether they are appropriate for the level of risk
associated with the borrower.
Evaluation of the purpose of the loan:
The bank will consider the purpose of the loan, such as whether it is for a business investment
or a personal expense, to determine the level of risk associated with the loan.
Examination of the borrower's financial stability:
The bank will assess the stability of the borrower's financial situation, including their job
security, savings, and investment portfolio, to determine the likelihood of them being able to
repay the loan.

13
Review of industry and market conditions:
The bank will also consider industry and market conditions that may impact the borrower's
ability to repay the loan, such as changes in interest rates, economic conditions, and
competition.
Risk analysis and decision-making:
Based on the information gathered during the previous steps, the bank will perform a risk
analysis to determine the level of risk associated with the loan. Based on this analysis, the
bank will make a decision about whether or not to approve the loan, and if so, under what
terms and conditions.
Overall, the model-making process for risky bank loans involves a thorough evaluation of
various factors that can impact the borrower's ability to repay the loan. The goal is to balance
the level of risk associated with the loan with the potential benefits of approving it, such as
generating revenue for the bank and helping the borrower achieve their financial goals .

2.2 Software requirement specifications


The purpose of this Software Requirements Specification (SRS) document is to define the
requirements for the development of a software system that uses the Support Vector Machine
(SVM) algorithm to evaluate the risk of bank loans. The system will be referred to as the
Risky Bank Loan (RBL) system. This document outlines the functional and non-functional
requirements for the RBL system.
Functional Requirements:
a) Data Pre-processing: The RBL system should be able to pre-process the data collected
to prepare it for SVM algorithm analysis. This includes data cleaning, data
normalization, and feature selection.
b) SVM Algorithm Implementation: The RBL system should use the SVM algorithm to
analyze the pre-processed data to determine the creditworthiness of the applicant. The
SVM algorithm should be able to handle both linear and non-linear data.
c) Risk Assessment: The RBL system should provide a risk assessment report for each
loan application based on the analysis of the SVM algorithm. The report should
include the probability of default and other risk indicators.
d) User Interface: The RBL system should have a user interface that allows bank staff to
input applicant data, view risk assessment reports, and manage loan applications.
e) Non-functional Requirements:
f) Performance: The RBL system should be able to process loan applications in a timely
manner. The system should be able to handle a large volume of loan applications
without compromising its performance.

g) Security: The RBL system should be designed with security in mind to protect
confidential customer information. The system should ensure that user data is
protected against unauthorized access, theft, and tampering.

h) Reliability: The RBL system should be reliable and available 24/7. The system should
be able to handle unexpected errors and ensure data integrity.

14
i) Scalability: The RBL system should be designed to accommodate future growth. The
system should be scalable to handle an increasing number of loan applications without
affecting its performance.

j) Usability: The RBL system should be easy to use for bank staff with minimal training.
The system should have a user-friendly interface that is intuitive and easy to navigate.

Conclusion:

The Risky Bank Loan system is a software system that uses the SVM algorithm to evaluate
the risk of bank loans. This document outlined the functional and non-functional
requirements for the system. These requirements should serve as a guide for the development
team to ensure the successful implementation of the system.

2.3 Purpose

The purpose of developing a risky bank loan system using SVM (Support Vector Machine)
algorithm is to help banks and financial institutions make more informed decisions when
granting loans to their clients.
The SVM algorithm is a powerful machine learning technique that can effectively identify
patterns and relationships within large datasets, making it well-suited for analyzing complex
financial data. By using SVM, the system can accurately predict the likelihood of a borrower
defaulting on their loan, based on a range of factors such as credit history, income,
employment status, and other relevant data.
With this system in place, banks can better assess the risk associated with lending money to a
particular borrower and make more informed decisions about whether to grant or deny a loan.
This can help to minimize the risk of loan defaults and improve the overall financial health of
the institution.
Furthermore, this system can also help to reduce the time and resources required for manual
loan analysis, by automating the loan evaluation process. This can lead to faster loan
processing times, increased efficiency, and a more streamlined lending process.
Overall, the purpose of developing a risky bank loan system using SVM algorithm is to
enable banks and financial institutions to make more informed, data-driven lending decisions,
and to minimize the risk of loan defaults and financial losses.

2.4 Scope
Introduction:
Risk management is a critical component of any financial institution, especially banks. To
prevent losses and ensure profitability, banks rely on risk assessment techniques to evaluate
and mitigate risk. One such technique is the use of machine learning algorithms, such as
Support Vector Machines (SVMs), to identify high-risk loans. The aim of this paper is to
describe the scope of making a project for a risky bank loan system using the SVM algorithm.
The paper will provide a comprehensive overview of the SVM algorithm, its application in
risk assessment, and the scope of using the algorithm to identify high-risk loans.

15
Background:
The banking industry is characterized by high levels of risk due to the nature of its operations.
Banks lend money to individuals and businesses, and these loans have to be repaid with
interest. However, not all borrowers are able to repay their loans, which can result in losses
for the bank. To minimize these losses, banks use risk assessment techniques to evaluate the
creditworthiness of borrowers and the potential risks associated with lending to them.
Support Vector Machines (SVMs) are a type of machine learning algorithm that can be used
for classification and regression analysis. The SVM algorithm works by identifying a
hyperplane that separates the data into different classes. The algorithm then finds the
maximum margin hyperplane, which is the hyperplane that is furthest from the data points of
both classes. SVMs are widely used in risk assessment and fraud detection, and they have
been shown to be effective in identifying high-risk loans.
Scope of the Project:
The scope of the project is to develop a risky bank loan system that uses the SVM algorithm
to identify high-risk loans. The system will be designed to evaluate loan applications based
on various parameters, including the borrower's credit history, income, and debt-to-income
ratio. The system will then assign a risk score to each loan application, indicating the
likelihood of the loan defaulting.
The SVM algorithm will be used to train the system on a dataset of historical loan data. The
dataset will include information about the borrower, the loan amount, the loan term, the
interest rate, and other relevant parameters. The system will use this data to identify patterns
and relationships that can be used to predict the likelihood of a loan defaulting.
The system will be designed to be user-friendly and easy to use. Bank employees will be able
to input loan application data into the system, and the system will generate a risk score for
each loan application. The system will also provide recommendations on whether to approve
or reject a loan application based on the risk score.
Challenges and Limitations:
One of the main challenges of developing a risky bank loan system using the SVM algorithm
is the availability of high-quality data. The system will require a large dataset of historical
loan data to train the SVM algorithm effectively. However, obtaining high-quality data can be
challenging, especially for smaller banks that may not have access to large datasets.
Another challenge is the complexity of the SVM algorithm. The SVM algorithm can be
difficult to understand and implement, especially for non-technical users. The system will
need to be designed in a way that is easy to use and understand for bank employees.
The accuracy of the system is another limitation. The accuracy of the system will depend on
the quality and quantity of data used to train the SVM algorithm. If the system is not trained
on a sufficiently large and diverse dataset, it may not be accurate in identifying high-risk
loans.

16
CHAPTER 3

WORKFLOW OF THE PROJECT

17
3.1 Data collection
The data set collected for predicting given data is split into Training set and Test set.
Generally, 7:3 ratios are applied to split the Training set and Test set. The Data Model which
was created using machine learning algorithms are applied on the Training set and based on
the test result accuracy, Test set prediction is done.

3.2 Data Pre-processing


Validation techniques in machine learning are used to get the error rate of the Machine
Learning (ML) model, which can be considered as close to the true error rate of the dataset. If
the data volume is large enough to be representative of the population, you may not need the
validation techniques. However, in real world scenarios, to work with samples of data that
may not be a true representative of the population of given dataset. To finding the missing
value, duplicate value and description of data type whether it is float variable or integer. The
sample of data used to provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyper parameters. The evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration. The validation set is used to
evaluate a given model, but this is for frequent evaluation. It as machine learning engineers
use this data to fine-tune the model hyper parameters. Data collection, data analysis, and the
process of addressing data content, quality, and structure can add up to a time-consuming to-
do list. During the process of data identification, it helps to understand your data and its
properties; this knowledge will help you choose which algorithm to use to build your model.
A number of different data cleaning tasks using Python Pandas library and specifically, it
focuses on probably the biggest data cleaning task, missing values and it able to more quickly
clean data. It wants to spend less time cleaning data, and more time exploring and modeling.
Some of these sources are just simple random mistakes. Other times, there can be a deeper
reason why data is missing. It’s important to understand these different types of missing data
from a statistics point of view. The type of missing data will influence how to deal with
filling in the missing values and to detect missing values, and do some basic imputation and
detailed statistical approach for dealing with missing data. Before, joint into code, it’s
important to understand the sources of missing data. Here are some typical reasons why data
is missing:
• User forgot to fill in a field.
• Data was lost while transferring manually from a legacy database.
• There was a programming error.
• Users chose not to fill out a field tied to their beliefs about how the results would be used
or interpreted.
Variable identification with Uni-variate, Bi-variate and Multi-variate analysis:
1. Import libraries for access and functional purpose & read the given dataset.
2. General Properties of analyzing the given dataset.
3. Display the given dataset in the form of data frame.
4. Show columns.
5. Shape of the data frame.
6. To describe the data frame.
7. Checking data type and information about dataset.

18
8. Checking for duplicate data.
9. Checking Missing values
lues of data frame.
10. Checking unique values of data frame.
11. Checking count values of data frame.
12. Rename and drop the given data frame.
13. To specify the type of values.
14. To create extra columns.

Import Read Data Data Pre--Processing


Packages

Figure 3.1: Data pre-Processing

Figure: 3.2: System Architecture

3.3 Data Cleaning


Importing the library packages with loading given dataset. To analyzing the variable
identification by data shape, data type and evaluating the missing values, duplicate values. A
validation dataset is a sample of data he
held
ld back from training your model that is used to give
an estimate of model skill while tuning model's and procedures that you can use to make the
best use of validation and test datasets when evaluating your models. Data cleaning /
preparing by rename the given dataset and drop the column etc. to analyze the uni
uni-variate, bi-
variant and multi-variant
variant process.

19
The steps and techniques for data cleaning will vary from dataset to dataset. The primary goal
of data cleaning is to detect and remove errors and anomalies to increase the value of data in
analytics and decision making.

3.4 Exploratory data analysis


Data visualization is an important skill in applied statistics and machine learning. Statistics
does indeed focus on quantitative descriptions and estimations of data. Data visualization
provides an important suite of tools for gaining a qualitative understanding. This can be
helpful when exploring and getting to know a dataset and can help with identifying patterns,
corrupt data, outliers, and much more. With a little domain knowledge, data visualizations
can be used to express and demonstrate key relationships in plots and charts that are more
visceral and stakeholders than measures of association or significance. Data visualization and
exploratory data analysis are whole fields themselves and it will recommend a deeper dive
into some the books mentioned at the end. Sometimes data does not make sense until it can
look at in a visual form, such as with charts and plots. Being able to quickly visualize of data
samples and others is an important skill both in applied statistics and in applied machine
learning. It will discover the many types of plots that you will need to know when visualizing
data in
Python and how to use them to better understand your own data.
1. How to chart time series data with line plots and categorical quantities with bar charts.
2. How to summarize data distributions with histograms and box plots.

Import Packages Read Data Data Visualization

Figure: 3.3: Data Visualization

20
Flow Chart

Data Collection

Data Pre-Processing

Data Cleaning

Model Selection

Model Taining & Evalution

Model Deployment

Figure:3.4:Flow Chart

3.5 Model Selection


Bank is an institution which is to accumulate funding from community of people in the form
of saving and to distribute it to them in the form of credit or others, in order to accelerate
living standard of the people. People who wants to do a credit must propose an apply to the
bank. The regulation in accepting and rejecting credit proposal considers five
requirements; character (customer’s agreement to fully pay the debt in certain term.
Furthermore, credit risk can rise if the customer is not able to pay his obligation within the
term it is called bad credit. Credit risk analysis can be done by applying some methods; one
of them is machine learning.

Machine learning is a combination of computer science, engineering, and statistics, so that


it also Includes as part of statistics. Machine learning makes more use of prediction rules that
will be generalized for new data. Machine learning approach is relatively simpler than
traditional approach (classical statistic) to deal with complex and big data. So it is suitable for
banking sector which mainly processes loads information of customer data and it will take a
lot of time if it is done using traditional one. One of the examples of a popular machine
learning algorithm is Support Vector Machines (SVM). It is a classification technique
using the most optimum hyper plane. The application of SVM can cope with some problems
dealing with gen analysis, financial, medical field, and also banking.

21
Some researchers have been done using the SVM algorithm to classify potential
customers who are able to pay debts or not in loan companies. However, the evaluation of the
model used is only the accuracy and error values. Therefore, the authors are interested in
using other evaluation model measures, such as the AUC value. This research analyzes
credit risk using machine learning method, namely SVM algorithm to classify
prospective customers into good credit or bad credit class. The classification process of
dataset was done by using programming language Python. The dataset were taken randomly
from 2015 to 2018 at Bank XX, as many as 610 data. The dataset were divided into two parts
along with percentage of the training data 80% and the testing data 20%. The variables used
were gender, plafond, rate, term of time, job, income, face amount, warranty, and loan history
as independent variable as well as credit status as dependent variable. This research was
done in three stages; data pre-processing, model building, and model testing. Model was
made using the training data then checked using the testing data to discover its performance.
Model evaluation was counted using confusion matrix which are accuracy, sensitivity,
specificity, precision, FI-score, false positive rate, false negative rate, and AUC (Area Under
the Curve) values

Support Vector Machines (SVM) is a popular supervised machine learning algorithm used for
classification and regression tasks. It works by finding the hyper plane in a high-dimensional
space that maximally separates the data points into different classes. In Python, you can
implement SVM using the Scikit-learn library, which provides a user-friendly interface for
training and testing SVM models.
Here's an example of how to use SVM for classification in Python:
First, you need to import the necessary libraries:
Next, you can load a sample dataset, such as the iris dataset, using the `load_iris()` function
from Scikit-learn:

python

iris = datasets.load_iris()

X = iris.data

y = iris.target

You can then split the dataset into training and testing sets using the `train_test_split()`
function:

python

fromsklearn import datasets

fromsklearn.model_selection import train_test_split

fromsklearn.svm import SVC

fromsklearn.metrics import accuracy_score

python

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.3, random_state=42)

22
Next, you can create an instance of the SVM classifier by calling the `SVC()` function:

python

svm = SVC(kernel='linear')

You can then train the classifier on the training data using the `fit()` method:
python

svm.fit(X_train, y_train)

Finally, you can make predictions on the test data using the `predict()` method and calculate
the accuracy of the model using the `accuracy_score()` function:

python

y_pred = svm.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)

This is just a simple example of using SVM for classification in Python. There are many
different hyper parameters you can tune in SVM to optimize the model's performance, such
as the choice of kernel, regularization parameter, and gamma value.

Figure 3.5: Support Vector Classifier

23
Figure 3.6: Flow Chart of SVM Model

3.6 Model Training and Evaluation


To train and evaluate an SVM algorithm in Python, you can follow these steps:
1. Load the dataset: You can load the dataset using Scikit-learn's `load_` or `fetch_`
functions, or you can read the data from a file.
2. Split the dataset: Split the dataset into training and testing sets using Scikit-learn's
`train_test_split()` function.
3. Pre-process the data: Perform any necessary pre-processing steps such as scaling,
normalization, or feature selection.
4. Create an SVM model: Create an instance of the SVM model with the appropriate hyper
parameters.
5. Train the model: Train the SVM model using the training data with the `fit()` method.
6. Evaluate the model: Evaluate the performance of the trained SVM model on the testing
data using various metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

24
We first load the Iris dataset and then split the data into training and testing sets with a 80:20
ratio. We create an SVM model with a linear kernel and regularization parameter `C=1` and
train the model using the training data. Finally, we evaluate the performance of the trained
SVM model on the testing data by computing the accuracy and classification report.
Note that in practice, you may need to tune the hyper parameters of the SVM model to obtain
the best performance on the given dataset. This can be done using cross-validation or grid
search techniques.

3.7 Model deployment


Once you have trained an SVM model in Python, you can deploy it for making predictions on
new data. Here are some steps for deploying an SVM model in Python:
Save the trained model: After training an SVM model, you can save it to a file using the
`joblib` module in Scikit-learn. This module provides functions to save and load models in a
binary format.
python

fromsklearn.externals import joblib

python

# Save the model to a file

joblib.dump(svm, 'svm_model.pkl')

Load the saved model: To use the trained SVM model for making predictions, you need to
load the model from the file using the `joblib` module. For example:

python

# Load the saved model from file

svm = joblib.load('svm_model.pkl')

Prepare new data: To make predictions on new data, you need to pre-process the data in the
same way as the training data. you may need to scale the data or perform feature selection.
Make predictions: Once you have loaded the trained SVM model and pre-processed the new
data, you can use the `predict()` method to make predictions on the new data.

25
CHAPTER 4
IMPLIMENTATION

26
4.1 Introduction to technologies used
In our project the technologies we have used are:

Technologies
used

Python Google colab

Numpy

Pandas

Seaboarn

Sklearn

Figure 4.1: Technologies used

First of all we had to choose a programming language which will be well suited for our
project requirements and we wanted to choose a programming language which is flexible,
easy to write and read and has a rich library available. So we decided to use the python
programming language.
4.1.1 Python
Python is a high-level programming language that has become increasingly popular in recent
years. Created in the late 1980s by Guido van Rossum, Python is known for its simplicity,
readability, and flexibility. It is an interpreted language, which means that the code can be
executed without being compiled beforehand, making it ideal for rapid prototyping and
development.
Python's popularity stems from its ability to be used in a wide variety of applications,
including web development, scientific computing, artificial intelligence, data analysis, and
automation. It also has a vast array of libraries and frameworks that make development faster
and more efficient. Python's syntax is designed to be simple and easy to understand, making
it accessible to new programmers and allowing them to focus on solving problems rather than
worrying about the intricacies of the language.

27
Python's versatility and ease of use have made it a favourite among developers, students and
businesses alike. It is widely used by tech giants such as Google, Facebook, and Amazon, as
well as small start-ups and individual developers. The language's community is also active
and vibrant, with a large number of resources available online, including tutorials,
documentation, and support forums
In summary, Python is a powerful and flexible language that is easy to learn and use. Its
widespread adoption and active community make it an excellent choice for a wide range of
applications, from simple scripts to complex software systems.
We use python programming language in our project because of its rich libraries and its
flexibility advantages because of the predefined functions and classes in python libraries like
numpy,pandas and sklearnetc we can grasp the possibility of quick successful processing and
implementation of our project.
Now after deciding the programming language it was time for us to decide about the
integrated development and learning environment (IDLE) for our project. Since we choose
python programming language we had a vast number of idles to choose from like visual
studio codes,PYcharm,spideretc But we thought to choose a platform which is cloud base so
we can easily share, store and present our project without having to carry it out physically.
We found a platform named Google colaboratory which allows us to write python code in an
online platform and stores it our cloud storage it also provide us with in time cpu and ram
units online.
4.1.2 Google Colab
Google Colab is a cloud-based platform that provides an interactive environment for writing
and running Python code. It is a free service offered by Google that allows users to access
powerful computing resources, including GPUs and TPUs, from anywhere with an internet
connection. With Google Colab, you can create and share documents called notebooks, which
contain code, text, images, and visualizations. These notebooks are stored on Google Drive
and can be easily shared and collaborated on with others. Google Colab also includes a wide
range of pre-installed libraries and tools, making it an excellent platform for data analysis,
machine learning, and artificial intelligence. With its easy-to-use interface and powerful
features, Google Colab is an ideal tool for developers, researchers, and students alike.
4.2 Modules & Modules Description
One of the main reasons we choose python programming language for this project of ours is
because of its vast libraries/modules which can be used in various fields such as data
analysis,AI,scientific computation etc.In our project the specific applications of those
libraries/modules which we were needed was data analysis,dataintegration,data visualisation
and machine learning algorithm implementation.So we choose our python libraries according
to those specific need there for we used the python library “NumPy” and the library “pandas”
for the data integration and data analysis purposes,then we used the python library “seaborn”
for the purpose on data visualisation and then finally we used the python library “sklearn” for
machine learning /algorithmic implementations in our project.
Now lets get to know each of this modules a little better.

4.2.1 NumPy
NumPy is a Python library that is commonly used for numerical computations and data
analysis. It offers a high-performance multidimensional array object and a wide variety of

28
tools for working with these arrays. Its popularity stems from its ability to efficiently perform
mathematical operations on entire arrays of data, which can often be much faster than
working with individual elements. NumPy also comes equipped with several built-in
functions for mathematical operations, including linear algebra, Fourier transforms, and
random number generation.
The main focus of NumPy is its ndarray (n-dimensional array) object, which is a
multidimensional container that holds data of the same type. This allows for fast
mathematical computations and operations to be carried out on entire arrays of data, which is
beneficial in scientific and engineering applications as well as data analysis and machine
learning. NumPy is open source and freely available, and it can be installed using a package
manager such as pip. It is also compatible with many other Python libraries, including
pandas.
As we had to include and handle the data set of the bank customers in our project and also do
analysis on it, so we can later modify the data as we want. For this reasons we choose the
Numpy library of python in our project as it converts our data set in an N-dimensional array
and provide us with various useful predefined functions and classes to operate much
efficiently on that array then applying individual operations on that data set, so that’s the
reason we choose the Numpy library.

To use NumPy in Python, you need to import the library first. You can install NumPy using
pip by running the following command in your terminal:
“pip install numpy”
After installation, you can import NumPy using the following line of code:
“googlecolab note book”
“import numpy as np”
This line of code imports the NumPy library and gives it the alias "np" for convenience.
4.2.2 Pandas
We used Pandas library in our data analysis project because Pandas is a popular open-source
library for data manipulation and analysis in Python. It provides high-level data structures
and functions designed to make working with structured data(which we created before with
NumPy) fast and easy.
The two main data structures in Pandas are Series and Data Frame. A Series is a one-
dimensional array-like object that can hold any data type, while a Data Frame is a two-
dimensional tabular data structure with labelled axes (rows and columns) that can also hold
any data type.
Pandas offers a wide range of functions for data manipulation, including filtering, sorting,
merging, grouping, and aggregation which are some exact operations that we needed for our
project.
Pandas can handle various data formats such as CSV(we used file type CSV as the data set in
our project), Excel, SQL databases, and more. It is widely used in industries such as finance,
economics, social sciences, and engineering, as well as in data science and machine learning
applications.

29
To use Pandas in Python, you need to import the library first. You can install Pandas using
pip by running the following command in your terminal:
“pip install pandas”
After installation, you can import Pandas using the following line of code:
“googlecolab note book”
import pandas as pd
This line of code imports the Pandas library and gives it the alias "pd" for convenience.
4.2.3 Seaborn
In our EDA project we used Seaborn as our data visualization library which based on
Matplotlib which is also a data visualisation library of python’s. Seaborn is built on top of
Matplotlib, which means you can still use Matplotlib functions when working with Seaborn.
Seaborn library provides a high-level interface for creating informative and visually
appealing statistical graphics and also the seaborn library makes it is easier to operate the data
visualisation operations
Seaborn is particularly useful for exploring and understanding data through visualizations. It
offers a wide range of statistical plots, such as scatter plots, line plots, bar plots, histograms,
box plots, violin plots, and many others. These plots are designed to showcase relationships,
distributions, and patterns in your data.
Here are some key features and benefits of using Seaborn:
1. High-level interface: Seaborn provides a simple and intuitive API for creating complex
statistical visualizations with minimal code.
2. Attractive default styles: Seaborn comes with pleasant and visually appealing default
styles, making your visualizations look professional without much customization.
3. Statistical enhancements: Seaborn extends the capabilities of Matplotlib by adding
additional statistical features to the plots. For example, it can automatically compute and
display confidence intervals or fit regression models to your data.
4. Easy integration with Pandas: Seaborn works seamlessly with Pandas data structures,
making it easy to visualize and explore datasets stored in DataFrames.
5. Categorical data support: Seaborn includes functions that handle categorical data
effectively, allowing you to create informative visualizations based on different categories.
To use Seaborn, you first need to install it. You can install Seaborn using pip by running the
following command in your terminal:
“pip install seaborn”
After installation, you can import Seaborn using the following line of code:
“googlecolab note book”
“importseaborn as sns”
By convention, Seaborn is often imported with the alias "sns" for brevity. Once
imported, you can start using Seaborn's functions and create visually appealing and
informative plots to explore and communicate your data effectively.
30
4.2.4 Sklearn
Sklearn is a popular open-source library for machine learning in Python. It is built on top of
NumPy, SciPy, and Matplotlib and provides a wide range of tools for supervised and
unsupervised machine learning tasks such as classification, regression, clustering,
dimensionality reduction, and model selection.
It was apparent for us to use sklearn library to implement machine learning in our project
because of its inter dependencies and connectivities with the libraries that we already used
previously (NumPy library and Seaborn library which is build on matplotlib library).The
sklearn library is build on NumPy and matplotlib libraries so choosing sklearn for our project
gave us way more dependency then any other libraries as our machine learning library.
Here are some key features and benefits of using Sklearn:
1. Simple and consistent API: Sklearn provides a simple and consistent API that makes it
easy to use for beginners and experts alike.
2. Wide range of algorithms: Sklearn includes a wide range of state-of-the-art machine
learning algorithms, from simple linear regression to advanced ensemble methods like
random forests and gradient boosting.
3. Integration with other Python libraries: Sklearn integrates well with other Python libraries
like Pandas and NumPy, making it easy to handle and manipulate data.
4. Model selection and evaluation: Sklearn provides tools for model selection and evaluation,
such as cross-validation and grid search, which helps you choose the best model for your
data.
5. Active development and community: Sklearn is under active development and has a large
community of contributors and users who provide support, documentation, and examples.
To use Sklearn, you first need to install it. You can install Sklearn using pip by running the
following command in your terminal:
“pip install scikit-learn”
After installation, you can import Sklearn using the following line of code:
“Google colab note book”
“importsklearn”
Sklearn is a large library with many modules and functions. To use a specific module or
function, you need to import it explicitly. For example, we used the train_test_split module
from sklearn.Model selection,
you can import it as follows:
“Google colab note book”
“fromsklearn.Model selection import train_test_split”
With Sklearn, we can start building and training machine learning models quickly and
efficiently, even if we are beginners.

31
4.2.5 Train_test_split:
Now about the modules under Sklearn
Sklearn,The
The `train_test_split` module is a function provided by
scikit-learn
learn (sklearn) library, which is widely used for machine learning tasks in Python. This
function allows you to split a dataset into training and testing subsets,
subsets, which is crucial for
evaluating and validating the performance of machine learning models.
The `train_test_split` function takes in one or more arrays or matrices as input, representing
the features and labels of your dataset. It randomly shuffles and ssplits
plits the data into two or
more portions based on the specified test size or train size. The typical usage splits the data
into a training set and a testing set, but you can also use it for more advanced scenarios like
cross-validation
Here is our code syntax
tax of the `train_test_split` function:

Figure: 4.2train_test_split module

In the example above, `X` represents the input features, `y` represents the corresponding
labels or target values, and `test_size=0.1` indicates that 10% of the data will be allocated
alloc for
testing, while the remaining 90% will be used for training. The `random_state` parameter is
optional and allows you to set a seed value for random shuffling, ensuring reproducibility of
the same train-test split.
The `train_test_split` function returns
returns four subsets: `X_train` (training features), `X_test`
(testing features), `y_train` (training labels), and `y_test` (testing labels). These subsets can
be used for training and evaluating machine learning models.
It's important to split our data into separate training and testing sets to evaluate how well our
model generalizes to unseen data. The training set is used to train the model, while the testing
set is used to assess its performance. This helps to identify potential issues like overfitting,
where
here the model performs well on the training data but poorly on unseen data.
By using `train_test_split`, we conveniently split our dataset into training and testing subsets
and proceed with building, training, and evaluating our machine learning model wit with
confidence.
4.2.6 SVM
Support Vector Machines (SVM) is a supervised machine learning algorithm used for both
classification and regression tasks. SVM aims to find an optimal hyperplane in a high high-
dimensional feature space that separates different classes or predicts continuous values.
The key idea behind SVM is to maximize the margin between the decision boundary (hyper
plane) and the closest data points from each class. These closest data points are called support
vectors, hence the name “Support
Support Vector Machines.” By maximizing the margin, SVM seeks
to achieve better generalization and robustness against noise in the data.
SVM has several advantages:

32
1. Effective in high-dimensional spaces: SVM performs well even in cases where the number
of features is greater than the number of samples. It is suitable for tasks with a large number
of features.
2. Versatility: SVM supports various kernel functions, such as linear, polynomial, radial basis
function (RBF), and sigmoid, allowing flexibility in capturing complex relationships in the
data.
3. Robust against overfitting: By maximizing the margin, SVM tends to generalize well and
be less prone to overfitting. It works particularly well in cases with clear margin separation.
However, SVM also has some considerations:
1. Computational complexity: SVM can be computationally expensive, especially for large
datasets, as it requires solving a quadratic programming problem.
2. Sensitivity to hyperparameters: The performance of SVM can be sensitive to the choice of
hyperparameters, such as the kernel type and C-parameter. Proper tuning is important for
optimal results.
Sklearn is a popular Python library that provides an SVM implementation with the `SVC`
class for classification tasks and `SVR` class for regression tasks. These classes offer various
configuration options and allows us to train SVM models on our datasets.
4.2.7 Accuracy_score
The `accuracy_score` function is a utility provided by scikit-learn (sklearn), a popular Python
library for machine learning. It is used to calculate the accuracy of a classification model by
comparing the predicted labels with the true labels.
The `accuracy_score` function takes two arguments: `y_true` and `y_pred`. Here's the basic
syntax:
- `y_true` represents the true labels or target values of the dataset.
- `y_pred` represents the predicted labels obtained from a classification model.
The `accuracy_score` function compares the corresponding elements in `y_true` and `y_pred`
and calculates the accuracy of the predictions. It returns a single floating-point number
representing the accuracy score, which indicates the percentage of correct predictions.
The `accuracy_score` function can handle multiclass classification as well, where it computes
the accuracy by considering all classes. It supports both 1D arrays (for binary classification)
and 2D arrays (for multiclass classification).
The `accuracy_score` function is a useful tool for evaluating the performance of classification
models. It provides a simple and intuitive metric to measure the accuracy of predictions,
making it easier to assess the model's effectiveness.

4.3 Sample code


Importing the Dependencies/modules
importnumpy as np
import pandas as pd
importseaborn as sns

33
fromsklearn.model_selection import train_test_split
fromsklearn import svm
fromsklearn.metrics import accuracy_score

Data Collection and Processing


# loading the dataset to pandas DataFrame
loan_dataset = pd.read_csv('/train_u6lujuX_CVtuZ9i (1).csv')
type(loan_dataset)

Exploratory Data analysis(EDA)


# printing the first 5 rows of the dataframe
loan_dataset.head()
# number of rows and columns
loan_dataset.shape
# statistical measures
loan_dataset.describe()
# number of missing values in each column
loan_dataset.isnull().sum()
# dropping the missing values
loan_dataset = loan_dataset.dropna()
# number of missing values in each column
loan_dataset.isnull().sum()
loan_dataset.head()
# label encoding
loan_dataset.replace({"Loan_Status":{'N':0,'Y':1}},inplace=True)
# printing the first 5 rows of the dataframe
loan_dataset.head()
# Dependent column values
loan_dataset['Dependents'].value_counts()
# replacing the value of 3+ to 4
loan_dataset = loan_dataset.replace(to_replace='3+', value=4)
# dependent values

34
loan_dataset['Dependents'].value_counts()
Feture selection/Data Visualization
# education& Loan Status
sns.countplot(x='Education',hue='Loan_Status',data=loan_dataset)
# marital status & Loan Status
sns.countplot(x='Married',hue='Loan_Status',data=loan_dataset)
Model Selection
# convert categorical columns to numerical values
loan_dataset.replace({'Married':{'No':0,'Yes':1},'Gender':{'Male':1,'Female':0},'Self_Employe
d':{'No':0,'Yes':1},
'Property_Area':{'Rural':0,'Semiurban':1,'Urban':2},'Education':{'Graduate':1,'Not
Graduate':0}},inplace=True)
loan_dataset.head()
# separating the data and label
X = loan_dataset.drop(columns=['Loan_ID','Loan_Status'],axis=1)
Y = loan_dataset['Loan_Status'] ```
(above given python code is a sample code of our project only been included upto the model
selection part of our project the machine learning part comes after this part which is excluded
in the sample code.)

35
SCREENSHOTS

36
Importing the Dependencies/modules
Dependencies/modules:

Figure 5.1: library files importing

Here in the above code we are importing the


th necessary library files for our project that we
have already discussed previously (point 5.1 & 5.2).
Data collection and processing::

Figure 5.2: dataset importing

Here we are adding the .CSV file or the data set that we are going to work on initializing
initializin it as
“loan_dataset”.also printing its type.
Exploratory data analysis:

Figure 5.3:head() implimentation

Here we are showing the first five rows of our dataset by using the “head()” function on our
“loan_dataset” then by using the “loan_dataset.shape” we can see that how many rows and
columns are actually there in our data set in our case it is 614 rows and 13 columns.

37
Figure 5.4:
5.4 describe() implimentation

Here we are using the “describe()” function to evaluate the mean,count,std,max,min


,25%,50% and 70% of applicant income,loanamount,credit history etc.

Figure 5.5: isnull() implimentation

Here we are evaluating the sum of all the null data in our data set by applying the “isnull()”
and the “sum()”functions on our dataset, meaning wherever in our da data
ta set there is a “0” or a
unrecognisable data we are finding them and couting them to see in total how many null data
is present in our data set.like as the figure above shows that in our data set the
“gender”column has total of 13 missing or null values.

38
Figure 5.6: dropna() implimentation

Here by using the “dropna()”function we are droping or deleting the null values which were
previously present in our dataset.
Then after executing the “dropna()”function if we again use the “isnull()” and
“sum()”function
ction then we can clearly see that there are no more null values present now in out
dataset.

Figure 5.7:
5.7 replace()&head() implement

Here we are using the “replace()”function.by using this function we can replace a existing
value by the values of our choosing
sing to a perticuler column in our dataset.we have to pass the
arguments of this “replace()” function like first we have to put the name of the column where
we want to operate then we have to use the “{“ and put the existiong value ”:” then the
replacement value,same again if more then one replacements there then”}” close and then we
have to give “inplace=true” to acctualy changing the values and lastly close the function
parenthesis “)” and we are done just have interpret the line no
now.

39
Feature selection and
d data visualization:

Figure 5.8: comparison between education and loan status column

Here we are using the Seaborn library’s function “countplot()” which is ploting a graphical
representation of our values in our dataset.here we are comparing the values of “education”
and “loan_status” column from our dataset.here the blue bars represents the rejected loans
and orange bars represents the approved loans.So as we can clearly see here by the graphical
comparison that the applicants who has their graduate de
degrees
grees have way more approved loans
then the non graduates.

Figure 5.9:: comparison between married and loan status column

Here we are again using the Seaborn library’s function “countplot()” which is ploting a
graphical representation of our values in our dataset.this time we are comparing the values of
“married” and “loan_status” column from our dataset.here the blue bars represents the
40
rejected loans and orange bars represents the approved loans.So as we can clearly see here by
the graphical comparison that
hat the applicants who are married have way more approved loans
then the non married applicants.

Figure 5.2.1
5.2.1: replace function implementation

Here by using the same “replace()” function as before we are replacing all of the text values
in our dataset into
nto integer values so that we can use the data which is available in the data set
as much as possible for the train and test process.

Figure 5.2.2: data separation

Here we are using the “drop()” fuction again to drop two coulmns from our data set from
which
hich one of them is we do not need which is the “Loan_ID” column and the other one which
we need to store separately in the variable named “y”.so now we have our default dataset
without the “Loan_ID” column in the “X” variable and we have the “Loan_status column” in
the “Y” variable therefore now we can successfully implement the machine learning model
(SVM) to compare between the two datasets in the variables “X” and ”Y” and create a data
boundary and compare the values form each side of the input variable
variabless and then performing
that model into a test data set we can see the accuracy score of how well our model can
predict the loan status of the applicants successfully.

Figure 5.2.3:
5.2.3 train_test_split module implimentation
Here we are splitting the data int
intoo a train model and a test model(described under point
5.2.3train_test_split module)

41
Figure 5.2.4:
5.2.4 SVM implementation
Here we are implementing the support vector machine model(SVM) and then storing the
accuracy scores of the training data set and printing
printing it then storing the testing data set then
lastly printing the prediction accuracy of our model, in our case which is 0.83 means above
80 percent. and here we can also see that the prediction score of the training data set and the
testing dataset is very
y close so we can also say this positively that there are no over fitting
problems in our model.

42
CONCLUSION

In this project of ours the analytical process started from data cleaning and processing,
missing value, exploratory analysis and finally model building and evaluation. The best
accuracy on public test set is higher accuracy score will be find out. This application can help
to find the Risky Bank Loan Customer Identification.
In this project we try to describe and implement the process of data analysis and AI prediction
as well as we can, here we went through every step from library files importing ,data set
importing, data integration, data analysis to all the way to machine learning model selection,
algorithm implementation, and prediction we described all the steps as we understand them.
This model can also be used to predict many different aspects like it could also be used by
customers to detect the suitable bank of their choice by analysing and comparing the data of
different banks in case the data is available. Like that this model can be used in various
different scenarios depending on the needs of the user.

43
REFARANCES

[1] Amruta S. Aphale , Dr. Sandeep R. Shinde, 2020, Predict Loan Approval in Banking System Machine
Learning Approach for Cooperative Banks Loan Approval, International Journal Of Engineering Research &
Technology (IJERT) Volume 09, Issue 08 (August 2020).

[2] Ashwini S. Kadam, Shraddha R. Nikam, Ankita A. Aher, Gayatri V. Shelke, Amar S.Chandgude,
“Prediction for Loan Approval using Machine Learning Algorithm” (IRJET) Volume: 08 Issue: 04 | Apr 2021.

[3] M. A. Sheikh, A. K. Goel and T. Kumar, "An Approach for Prediction of Loan Approval using Machine
Learning Algorithm," 2020 International Conference on Electronics and Sustainable Communication Systems
(ICESC), 2020, pp. 490494, doi: 10.1109/ICESC48915.2020.9155614.

[4] Rath, Golak& Das, Debasish& Acharya, Biswaranjan. (2021). Modern Approach for Loan Sanctioning in
Banks Using Machine Learning. Pages={179-188} 10.1007/978-981-15-5243-4_15.

[5] Vincenzo Moscato, Antonio Picariello, Giancarlo Sperlí, A benchmark of machine learning approaches for
credit score prediction, Expert Systems with Applications, Volume 165, 2021, 113986, ISSN 0957-4174.

[6] YashDivate, PrashantRana, Pratik Chavan, “Loan Approval Prediction Using Machine Learning”
International Research Journal of Engineering and Technology (IRJET) Volume: 08 Issue: 05 | May 2021.

44

You might also like