Data Mining Project

Uploaded by

saisankarpothana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views5 pages

Data Mining Project

Uploaded by

saisankarpothana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Vehicle loan default detections

Akhil Sai Peddireddy Akhila

Charlottesville, USA Charlottesville, USA
[email protected] [email protected]

Sanatkumar Goutham
Charlottesville, USA Charlottesville, USA
[email protected] [email protected]

ABSTRACT from Kaggle. It has about 233,000 Training Samples and

The increased vehicle loan rejection rates by the financial insti- 112,000 Test Samples. This data set has 40 features. Some
tutions are a direct result of increase in the default of vehicle of the important features in this data set are - Perform Cns
loans which leads to significant losses for the institutions. Our Score (Bureau Score), Disbursed Amount (total amount that
goal is to identify the clients capable of repayment so that their was disbursed for all the loans at the time of disbursement),
loan is not rejected. The financial institution will also benefit ltv (Loan to Value of the asset), Current pincode (Current
from knowing which clients are likely to default on a vehicular pincode of the customer), Primary Sanctioned Amount (total
loan. In order to address this issue we plan to implement a amount that was sanctioned for all the loans at the time of
system of data mining algorithms to classify the loanee as disbursement) and Primary overdue accounts (count of default
defaulter or not a defaulter. accounts at the time of disbursement).
This project uses "LT Vehicle Loan Default Prediction" data The main goal of the project is to identify loan defaulters.
set from Kaggle[?]. Various information regarding the loan
This project has three major parts
and loanee are provided in the data set such as Loanee Infor-
mation (Demographic data like age, Identity proof etc.), Loan 1. Preliminary Analysis of the Data.
Information (Disbursal details, loan to value ratio etc.), Bureau
data history (Bureau score, number of active accounts, the 2. Using Data Mining Techniques for Prediction
status of other loans, credit history etc). On this data set, we 3. Improvement of Prediction using advanced techniques
performed classification using algorithms K-Nearest Neighbor
(KNN)[2], Support Vector Machine (SVM)[3], Naive Bayes, METHODOLOGY
Decision Trees, Random Forest[10], XGBoost[8] and Logistic The following are the main implementation steps in the
regression. The dataset contains ground truth labels and thus project:
we can find out which classification algorithms can perform
better on this type of dataset. We use accuracy, recall and 1. Exploratory Data Analysis
precision. 2. Data pre-processing

INTRODUCTION 3. Classification using Data Mining algorithms

When there is an increase in the default of vehicle loans, it 4. Using advanced techniques
leads to significant losses for the institutions which is in turn
leads to the increase in the rejection rates of loans. To identify Exploratory Data Analysis
the clients capable of repayment so that their loan is not re- Exploratory Data Analysis was performed in two stages
jected, we use data mining techniques to improve the efficiency
of vehicle loan approval process. This has major advantages 1. Analysing the features of the data set 2. Finding the most
like increasing customer satisfaction and reducing bad loans. and least important features in the data set
The data set being used is from a financial institution named Analysis of some Features
LT. The data set is "LT Vehicle Loan Default Prediction" First we have analyzed the some features like Aadhaar flag,
Voter flag and PAN flag.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
An Aadhaar card is a unique number issued to every citizen
on the first page. Copyrights for components of this work owned by others than the in India. This data set has a flag that tells if a client has
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or aadhar card or not. We analyzed Aadhaar flag and found that
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. in training set, the number of people having aadhar card is
greater than the number of people not having aadhar card.
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. Further analyzing it against the loan defaulters, we find that
ISBN 978-1-4503-2138-9. among the people having aadhar card, the percentage of them
DOI: 10.1145/1235
defaulting is about 21 percent whereas, among the people who
do not have aadhar card, the percentage of people defaulting
is around 27 percent

We see that this case is flipped in case of voter id. When

analyzing Voter flag, we find that in training set, the number
of people not having voter id is greater than the number of
people having voter id. Further analyzing it against the loan
defaulters, we find that among the people having voter id,
about 25 percent default loans whereas among the people not
having a voter id, around 20 percent of them default.
We further analyzed the PAN flag. A permanent account
number is a ten-character alphanumeric identifier, issued in
the form of a laminated "PAN card", by the Indian Income
Tax Department. The PAN flag, in this data set tells whether
a person has PAN card or not. When analyzing PAN flag,
we find that in training set, the number of people not having 1. The top 4 Most Important features are CNS.score, Disbursed
PAN card is greater than the number of people having PAN amount, LTV and Zip code
card. Further analyzing it against the loan defaulters, we find 2. The top 3 Least Important features are DOB, New account
that among the people having PAN card, about 23.5 percent in last 6 months and Secondary accounts
default loans whereas among the people not having a PAN
card, around 20 percent of them default loans.
Data pre-processing
Feature Importance Data pre-processing is an important step when applying Data
Next, we analyzed the importance of the features in predicting Mining techniques. The following are the various processing
the loan defaulters and the following is what we found. done on our dataset.
on the majority vote of it’s nearest neighbours from the train-
ing data set.
SVM
SVM whose full form is Service Vector Machine, is another
statistical learning technique. It is used to solve both linear
and non-linear problems because it has non-linear kernels in
addition to linear kernels such as rbf, polynomial, etc. The idea
is to create a line or hyper plane that separates the data into
classes and it tries to maximize the margin around it. It also
has error factor which decides how much it should penalize
the wrongly classified values.
Decision tree
Decision Tree is a tree flowchart like structure that divides
Dropped columns the data into different subgroups based on conditions in order
We dropped certain columns that have all unique values or all to classify the data. A condition is selected such that the
rows have same value as these features do not add value to the classification is as pure as possible. At each node of the tree
prediction. UniqueID, Employee Code ID, Perform cns score a decision is made about how to split the data and how to
description, Mobile No avl flag get the purest nodes. In order to calculate what attribute to
split on, we can use different measures like Gini, entropy or
Date to months
misclassification error. When you travel down the tree, finally
Converted date to months for certain features Average account at leaf nodes we find the labels of the data of a particular
age, Credit history length . sample.
Date to Quarter
XGboost
Extracted the quarters from the feature Disbursal date, which
In boosting, we train multiple models and use weighted voting
is a date feature
methods to classify the data. At every iteration we train a
Categorized Date of Birth model with samples and check for wrongly classified samples.
Made buckets of 5 years for the feature "Date of Birth" The weight for these samples is increased for the next iteration
and new model is trained. After this training ends, all the weak
One hot encoding learners are combined and finally used to classify a sample by
One hot encoded some features like Disbursal date, Date of running it through all the models. The predictions from these
Birth models are used as a weighted voting mechanism to classify
Scaling the data.
Scaled the features when required. Certain algorithms like Random Forest
k nearest neighbours which is dependant on the distance, re- For random forest, we select random features in order to check
quires scaling and performs better prediciton when scaled. We for best split attribute. And we build multiple such tress and
have used Standard Scaler and Min Max scaler. use max voting classifier to classify the data.
Classification using ML algorithms Neural Networks
Classification is the process of assigning data points to pre- Neural Network algorithms are loosely based on our brain.
defined classes or categories. In this project we have im- They are multi-layer networks of neurons that are used for
plemented six classification algorithms and a neural network classification. The following figure shows how a basic neural
which are described in the following sections. network looks like. This image shows a neural network with 6
Naive Bayes Classifier
inputs, a middle layer containing 4 neurons and the layer 3 is
output layer.
Bayesian classifiers are statistical classifiers which predict
class membership probabilities. They use Bayes theorem
to calculate posterior probability. Naive Bayes is a simple
classifier which assumes that the attributes are conditionally
independent to each other, which simplifies the calculations.
Logistic Regression
Logistic Regression is a Statistical Learning technique. It
is one of the Supervised Machine Learning methods used in
Classification tasks.
K Nearest Neighbour Classifier
Nearest Neighbour Classification Algorithm is one of the clas-
sification algorithms that classifies the new data points based
Use advanced techniques Data Imbalance
SMOTE We found that this is because of the data imbalance in the
SMOTE is the short form of Synthetic Minority Oversam- data set. The loan defaulter, which is our class label was
pling Technique. It is implemented by finding the k-nearest- imbalanced. In order to overcome this and to improve our
neighbors for minority class observations (finding similar precision (which is important because we have to correctly
observations) and Randomly choosing one of the k-nearest- classify loan defaulters as loan defaulters, otherwise the
neighbors and using it to create a similar, but randomly company will incur losses), we use some advanced tech-
tweaked, new observation. niques as shown below

PCA SMOTE
PCA, shortform for Principal Component Analysis is a statis- We have fixed the imbalanced data only on the training data.
tical procedure which is useful for Dimensionality Reduction. This will ensure that the generated data won’t bleed into
Reducing the number of features is called Dimensionality Re- testing data and we can ensure that the result we get out of
duction. It transforms the input features and then we can drop this can be generalised.
the less important ones from the transformed features while
PCA
still using the valuable parts of our original features.
We have applied PCA on our data set for performing clas-
sification using SVM. This is beacause, SVM takes a lot
EXPERIMENTS
of time given the size of our dataset and having around 40
1. Preliminary Results After data pre processing, we per-
features and PCA is very useful in such case. The following
formed classification using algorithms K-Nearest Neigh-
shows how the variance changes with the increase in the
bor (KNN), Support Vector Machine (SVM), Naive Bayes,
number of components. We can see that after 5 components,
Decision Trees, Random Forest, XGBoost and Logistic re-
there is not much change in the variance percentage. We
gression. The following are the initial results of running
have used top 7 significant features returned by PCA to per-
these classification techniques on our data set.
form prediction. This would decrease the amount of time
Naive Bayes taken for training the model.
Accuracy - 77.47 Precision - 36.5
Recall - 4.65
Logistic Regression
Accuracy - 78.17 Precision - 40.48
Recall - 0.49
KNN
Accuracy - 74.55 Precision - 29.074
Recall - 11.72
3. current results The following are the results obtained as a
SVM
result of using SMOTE[5] and PCA[6] and then performing
Accuracy - 78.0 Precision - 28. 9 prediction. First, we display the confusion matrix results for
Recall - 12.3 each classification technique and then present the accuracy,
precision and recall in the tabular format.
XGboost
Accuracy - 78.22 Precision - 53.1 Naive Bayes
0 1
Recall - 7.3 0 17124 9162
Random Forest
1 4248 8405
Accuracy - 78.22 Precision - 50.13 Logistic Regression
0 1
Recall - 1.12 0 20238 6048
Decision tree 1 5162 7491
Accuracy - 77.14 Precision - 31.5 Random Forest
0 1
Recall - 4.25 0 25418 868
1 2779 9874
2. Improvements
KNN
From the preliminary results, we can see that the precision 0 1
is really low, which is because it is not classifying one class 0 19802 6484
of records properly. 1 4383 8270
Decision Tree CONCLUSION
0 1 We have significantly improved the accuracy and precision
0 25301 985 of predicting a loan defaulter. We also found that Neural
1 911 11742 Networks gives the best prediction with a high precision of
SVM 94.2. Decision Tree is also comparable with a high accuracy
0 1 of 95.13. We have some ideas on how this can be further
0 23098 3188 improved. This could be the future work for the project. The
1 4966 7687 accuracy can be further improved by creating an ensemble.
Gradient Boosting We can also use data and feature engineering techniques to
0 1 further improve accuracy. The Neural Networks can be further
0 24653 1633 fine tuned and more models can be explored.
1 1688 10965
References
Neural Networks [1] 2019. LT Vehicle Loan Default Prediction. (2019).
0 1 https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-
0 25628 658 predictiondata_dictionary.csv
1 1972 10681 [2] T. Cover &P. Hart Mickey Haggblade.Nearest neighbor pattern classifica-
Final Results summarized tion, IEEE Transactions on Information Theory 2013.
The following shows the final results of all the Algorithms [3] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recog-
run so far. We can see that Neural Networks gives the best nition,” submitted to Data Mining and Knowledge Discovery, 1998.
prediction with a high precision of 94.2. Decision Tree is [4] Lloyd, Stuart P. "Least squares quantization in PCM." Information Theory,
also comparable with a high accuracy of 95.13. IEEE Transactions on 28.2 (1982): 129-137.
[5] Nitesh V. Chawla , Kevin W. Bowyer. SMOTE: synthetic minority over-
sampling technique, Journal of Artificial Intelligence Research archive Vol-
ume 16 Issue 1, January 2002
[6] Hervé Abdi, Lynne J. Williams. Principal component analysis, WIREs
Computational Statistics, 2010
[7] Aadhar, https://uidai.gov.in/
[8] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[9] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[10] Leo Breiman, Random Forests, Journal Machine Learning, Volume 45
Issue 1, October 1 2001, Pages 5-32

Internship - Report - On - Ai - and - ML - 23P15A0513 SARATH - Final
No ratings yet
Internship - Report - On - Ai - and - ML - 23P15A0513 SARATH - Final
32 pages
This Study Resource Was: Bank Loan Default Prediction Model
No ratings yet
This Study Resource Was: Bank Loan Default Prediction Model
9 pages
EDA Credit Case Study (Karan Pratap Singh)
100% (1)
EDA Credit Case Study (Karan Pratap Singh)
63 pages
EDA Loan Case Study PPT - Ver 1.1
80% (5)
EDA Loan Case Study PPT - Ver 1.1
22 pages
Capstone Project PPT
No ratings yet
Capstone Project PPT
13 pages
IDS 575 Project Report
No ratings yet
IDS 575 Project Report
9 pages
Movie Recommendation System: Using Machine Learning
No ratings yet
Movie Recommendation System: Using Machine Learning
7 pages
Vechile Loan Defaulter
No ratings yet
Vechile Loan Defaulter
23 pages
Final Project
No ratings yet
Final Project
7 pages
Vehicle Loan Default Prediction
No ratings yet
Vehicle Loan Default Prediction
14 pages
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
No ratings yet
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
4 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
1 PB
No ratings yet
1 PB
13 pages
Omkar Gaikwad Project..Suk
No ratings yet
Omkar Gaikwad Project..Suk
23 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Prediction of Credit-Card Defaulters A Comparative
No ratings yet
Prediction of Credit-Card Defaulters A Comparative
6 pages
Lending Club Data Analysis PDF
No ratings yet
Lending Club Data Analysis PDF
3 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
26 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Credit Card EDA: Authored by
100% (1)
Credit Card EDA: Authored by
16 pages
Final Report
No ratings yet
Final Report
69 pages
EDA Group Case Study
No ratings yet
EDA Group Case Study
33 pages
Bank Loan Casestudy
No ratings yet
Bank Loan Casestudy
17 pages
Final Project Title and Abstract Group-3
No ratings yet
Final Project Title and Abstract Group-3
5 pages
Coser Al. Crisan Albu (T)
No ratings yet
Coser Al. Crisan Albu (T)
17 pages
WRITEUP
No ratings yet
WRITEUP
2 pages
Assessment of Default Risk Factors in The Disbursement of Home Loans
No ratings yet
Assessment of Default Risk Factors in The Disbursement of Home Loans
13 pages
Trainity-Data An
No ratings yet
Trainity-Data An
24 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Lending Club Case Study - Shambhu - Rakesh
No ratings yet
Lending Club Case Study - Shambhu - Rakesh
14 pages
Credit Score Prediction.
No ratings yet
Credit Score Prediction.
3 pages
Credit Card Default Predication: Larissa Pereira Meet Patel
No ratings yet
Credit Card Default Predication: Larissa Pereira Meet Patel
6 pages
Loan Application Approval Prediction
No ratings yet
Loan Application Approval Prediction
14 pages
Loan Default Prediction
No ratings yet
Loan Default Prediction
7 pages
PA v0.7
No ratings yet
PA v0.7
15 pages
PA v0.20
No ratings yet
PA v0.20
17 pages
Capstone Project Report v1 - Abhishek Bihani
No ratings yet
Capstone Project Report v1 - Abhishek Bihani
16 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Spark Python Course APPLY Project Problem Statement
No ratings yet
Spark Python Course APPLY Project Problem Statement
3 pages
Trainity Data Analytics Training Project 6
No ratings yet
Trainity Data Analytics Training Project 6
22 pages
HCI ScorecardModel PPT
No ratings yet
HCI ScorecardModel PPT
9 pages
Data Mining Case Study PDF
No ratings yet
Data Mining Case Study PDF
21 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Credit EDA Assignment PDF
No ratings yet
Credit EDA Assignment PDF
40 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
Decision Making Assignment
No ratings yet
Decision Making Assignment
6 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
No ratings yet
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
9 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
42 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
Credit Default Project 23124001
No ratings yet
Credit Default Project 23124001
13 pages
Hp1047, Vmr286 Loan Default Prediction Final Report
No ratings yet
Hp1047, Vmr286 Loan Default Prediction Final Report
8 pages
Edafinal 1
No ratings yet
Edafinal 1
32 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
2 pages
An Kit
No ratings yet
An Kit
12 pages
Prediciton of Loan Apprval-Project Report
No ratings yet
Prediciton of Loan Apprval-Project Report
82 pages
Cluster Credit Risk R PDF
No ratings yet
Cluster Credit Risk R PDF
13 pages
Research Report
No ratings yet
Research Report
8 pages
Banking Project Final
No ratings yet
Banking Project Final
38 pages
Dictionary of Credit Risk Business Terms - EXTRACT
From Everand
Dictionary of Credit Risk Business Terms - EXTRACT
Steve Preece
No ratings yet
Machine Learning Road Map
No ratings yet
Machine Learning Road Map
5 pages
Classification of White Rice Grain Quality Using ANN: A Review
No ratings yet
Classification of White Rice Grain Quality Using ANN: A Review
9 pages
Datamites Certified Data Scientist Brochure
No ratings yet
Datamites Certified Data Scientist Brochure
19 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
89 pages
(IJCST-V12I4P5) :vaishali Sarde, Pankaj Sarde
No ratings yet
(IJCST-V12I4P5) :vaishali Sarde, Pankaj Sarde
8 pages
SHUBHAM KUMAR TIWARI - Finalprojectreport - Shubham Tiwari
No ratings yet
SHUBHAM KUMAR TIWARI - Finalprojectreport - Shubham Tiwari
81 pages
Mcq's On Unit V
100% (1)
Mcq's On Unit V
6 pages
Brain Tumor Report
No ratings yet
Brain Tumor Report
45 pages
Final PPT 2
No ratings yet
Final PPT 2
42 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
A Chaos-Based Complex Micro-Instruction Set For Mitigating Instruction Reverse Engineering
No ratings yet
A Chaos-Based Complex Micro-Instruction Set For Mitigating Instruction Reverse Engineering
14 pages
CCCS CIC AndMal 2020
No ratings yet
CCCS CIC AndMal 2020
6 pages
Machine Learning Fundamentals (Updated)
No ratings yet
Machine Learning Fundamentals (Updated)
42 pages
ML - Lab - Programs - J
No ratings yet
ML - Lab - Programs - J
18 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Be It OFFLINE HANDWRITING RECOGNITION USING NEURAL NETWORK
No ratings yet
Be It OFFLINE HANDWRITING RECOGNITION USING NEURAL NETWORK
115 pages
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
No ratings yet
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
13 pages
Machine Learning For Evolution Strategies by Oliver Kramer (Auth.)
No ratings yet
Machine Learning For Evolution Strategies by Oliver Kramer (Auth.)
120 pages
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
No ratings yet
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
4 pages
Final Report
No ratings yet
Final Report
40 pages
Top 10 Machine Learning Algo PDF
No ratings yet
Top 10 Machine Learning Algo PDF
15 pages
Breast Cancer
No ratings yet
Breast Cancer
20 pages
Ai Syllabus
No ratings yet
Ai Syllabus
7 pages
LP-III Lab Manual
No ratings yet
LP-III Lab Manual
49 pages
Pa 01 Density Estimation
No ratings yet
Pa 01 Density Estimation
25 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Drug Awareness
No ratings yet
Drug Awareness
27 pages

Data Mining Project

Uploaded by

Data Mining Project

Uploaded by

Vehicle loan default detections

Akhil Sai Peddireddy Akhila

ABSTRACT from Kaggle. It has about 233,000 Training Samples and

INTRODUCTION 3. Classification using Data Mining algorithms

We see that this case is flipped in case of voter id. When

You might also like