0% found this document useful (0 votes)
19 views5 pages

Data Mining Project

Uploaded by

saisankarpothana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Data Mining Project

Uploaded by

saisankarpothana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Vehicle loan default detections

Akhil Sai Peddireddy Akhila


Charlottesville, USA Charlottesville, USA
[email protected] [email protected]

Sanatkumar Goutham
Charlottesville, USA Charlottesville, USA
[email protected] [email protected]

ABSTRACT from Kaggle. It has about 233,000 Training Samples and


The increased vehicle loan rejection rates by the financial insti- 112,000 Test Samples. This data set has 40 features. Some
tutions are a direct result of increase in the default of vehicle of the important features in this data set are - Perform Cns
loans which leads to significant losses for the institutions. Our Score (Bureau Score), Disbursed Amount (total amount that
goal is to identify the clients capable of repayment so that their was disbursed for all the loans at the time of disbursement),
loan is not rejected. The financial institution will also benefit ltv (Loan to Value of the asset), Current pincode (Current
from knowing which clients are likely to default on a vehicular pincode of the customer), Primary Sanctioned Amount (total
loan. In order to address this issue we plan to implement a amount that was sanctioned for all the loans at the time of
system of data mining algorithms to classify the loanee as disbursement) and Primary overdue accounts (count of default
defaulter or not a defaulter. accounts at the time of disbursement).
This project uses "LT Vehicle Loan Default Prediction" data The main goal of the project is to identify loan defaulters.
set from Kaggle[?]. Various information regarding the loan
This project has three major parts
and loanee are provided in the data set such as Loanee Infor-
mation (Demographic data like age, Identity proof etc.), Loan 1. Preliminary Analysis of the Data.
Information (Disbursal details, loan to value ratio etc.), Bureau
data history (Bureau score, number of active accounts, the 2. Using Data Mining Techniques for Prediction
status of other loans, credit history etc). On this data set, we 3. Improvement of Prediction using advanced techniques
performed classification using algorithms K-Nearest Neighbor
(KNN)[2], Support Vector Machine (SVM)[3], Naive Bayes, METHODOLOGY
Decision Trees, Random Forest[10], XGBoost[8] and Logistic The following are the main implementation steps in the
regression. The dataset contains ground truth labels and thus project:
we can find out which classification algorithms can perform
better on this type of dataset. We use accuracy, recall and 1. Exploratory Data Analysis
precision. 2. Data pre-processing

INTRODUCTION 3. Classification using Data Mining algorithms


When there is an increase in the default of vehicle loans, it 4. Using advanced techniques
leads to significant losses for the institutions which is in turn
leads to the increase in the rejection rates of loans. To identify Exploratory Data Analysis
the clients capable of repayment so that their loan is not re- Exploratory Data Analysis was performed in two stages
jected, we use data mining techniques to improve the efficiency
of vehicle loan approval process. This has major advantages 1. Analysing the features of the data set 2. Finding the most
like increasing customer satisfaction and reducing bad loans. and least important features in the data set
The data set being used is from a financial institution named Analysis of some Features
LT. The data set is "LT Vehicle Loan Default Prediction" First we have analyzed the some features like Aadhaar flag,
Voter flag and PAN flag.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
An Aadhaar card is a unique number issued to every citizen
on the first page. Copyrights for components of this work owned by others than the in India. This data set has a flag that tells if a client has
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or aadhar card or not. We analyzed Aadhaar flag and found that
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. in training set, the number of people having aadhar card is
greater than the number of people not having aadhar card.
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. Further analyzing it against the loan defaulters, we find that
ISBN 978-1-4503-2138-9. among the people having aadhar card, the percentage of them
DOI: 10.1145/1235
defaulting is about 21 percent whereas, among the people who
do not have aadhar card, the percentage of people defaulting
is around 27 percent

We see that this case is flipped in case of voter id. When


analyzing Voter flag, we find that in training set, the number
of people not having voter id is greater than the number of
people having voter id. Further analyzing it against the loan
defaulters, we find that among the people having voter id,
about 25 percent default loans whereas among the people not
having a voter id, around 20 percent of them default.
We further analyzed the PAN flag. A permanent account
number is a ten-character alphanumeric identifier, issued in
the form of a laminated "PAN card", by the Indian Income
Tax Department. The PAN flag, in this data set tells whether
a person has PAN card or not. When analyzing PAN flag,
we find that in training set, the number of people not having 1. The top 4 Most Important features are CNS.score, Disbursed
PAN card is greater than the number of people having PAN amount, LTV and Zip code
card. Further analyzing it against the loan defaulters, we find 2. The top 3 Least Important features are DOB, New account
that among the people having PAN card, about 23.5 percent in last 6 months and Secondary accounts
default loans whereas among the people not having a PAN
card, around 20 percent of them default loans.
Data pre-processing
Feature Importance Data pre-processing is an important step when applying Data
Next, we analyzed the importance of the features in predicting Mining techniques. The following are the various processing
the loan defaulters and the following is what we found. done on our dataset.
on the majority vote of it’s nearest neighbours from the train-
ing data set.
SVM
SVM whose full form is Service Vector Machine, is another
statistical learning technique. It is used to solve both linear
and non-linear problems because it has non-linear kernels in
addition to linear kernels such as rbf, polynomial, etc. The idea
is to create a line or hyper plane that separates the data into
classes and it tries to maximize the margin around it. It also
has error factor which decides how much it should penalize
the wrongly classified values.
Decision tree
Decision Tree is a tree flowchart like structure that divides
Dropped columns the data into different subgroups based on conditions in order
We dropped certain columns that have all unique values or all to classify the data. A condition is selected such that the
rows have same value as these features do not add value to the classification is as pure as possible. At each node of the tree
prediction. UniqueID, Employee Code ID, Perform cns score a decision is made about how to split the data and how to
description, Mobile No avl flag get the purest nodes. In order to calculate what attribute to
split on, we can use different measures like Gini, entropy or
Date to months
misclassification error. When you travel down the tree, finally
Converted date to months for certain features Average account at leaf nodes we find the labels of the data of a particular
age, Credit history length . sample.
Date to Quarter
XGboost
Extracted the quarters from the feature Disbursal date, which
In boosting, we train multiple models and use weighted voting
is a date feature
methods to classify the data. At every iteration we train a
Categorized Date of Birth model with samples and check for wrongly classified samples.
Made buckets of 5 years for the feature "Date of Birth" The weight for these samples is increased for the next iteration
and new model is trained. After this training ends, all the weak
One hot encoding learners are combined and finally used to classify a sample by
One hot encoded some features like Disbursal date, Date of running it through all the models. The predictions from these
Birth models are used as a weighted voting mechanism to classify
Scaling the data.
Scaled the features when required. Certain algorithms like Random Forest
k nearest neighbours which is dependant on the distance, re- For random forest, we select random features in order to check
quires scaling and performs better prediciton when scaled. We for best split attribute. And we build multiple such tress and
have used Standard Scaler and Min Max scaler. use max voting classifier to classify the data.
Classification using ML algorithms Neural Networks
Classification is the process of assigning data points to pre- Neural Network algorithms are loosely based on our brain.
defined classes or categories. In this project we have im- They are multi-layer networks of neurons that are used for
plemented six classification algorithms and a neural network classification. The following figure shows how a basic neural
which are described in the following sections. network looks like. This image shows a neural network with 6
Naive Bayes Classifier
inputs, a middle layer containing 4 neurons and the layer 3 is
output layer.
Bayesian classifiers are statistical classifiers which predict
class membership probabilities. They use Bayes theorem
to calculate posterior probability. Naive Bayes is a simple
classifier which assumes that the attributes are conditionally
independent to each other, which simplifies the calculations.
Logistic Regression
Logistic Regression is a Statistical Learning technique. It
is one of the Supervised Machine Learning methods used in
Classification tasks.
K Nearest Neighbour Classifier
Nearest Neighbour Classification Algorithm is one of the clas-
sification algorithms that classifies the new data points based
Use advanced techniques Data Imbalance
SMOTE We found that this is because of the data imbalance in the
SMOTE is the short form of Synthetic Minority Oversam- data set. The loan defaulter, which is our class label was
pling Technique. It is implemented by finding the k-nearest- imbalanced. In order to overcome this and to improve our
neighbors for minority class observations (finding similar precision (which is important because we have to correctly
observations) and Randomly choosing one of the k-nearest- classify loan defaulters as loan defaulters, otherwise the
neighbors and using it to create a similar, but randomly company will incur losses), we use some advanced tech-
tweaked, new observation. niques as shown below

PCA SMOTE
PCA, shortform for Principal Component Analysis is a statis- We have fixed the imbalanced data only on the training data.
tical procedure which is useful for Dimensionality Reduction. This will ensure that the generated data won’t bleed into
Reducing the number of features is called Dimensionality Re- testing data and we can ensure that the result we get out of
duction. It transforms the input features and then we can drop this can be generalised.
the less important ones from the transformed features while
PCA
still using the valuable parts of our original features.
We have applied PCA on our data set for performing clas-
sification using SVM. This is beacause, SVM takes a lot
EXPERIMENTS
of time given the size of our dataset and having around 40
1. Preliminary Results After data pre processing, we per-
features and PCA is very useful in such case. The following
formed classification using algorithms K-Nearest Neigh-
shows how the variance changes with the increase in the
bor (KNN), Support Vector Machine (SVM), Naive Bayes,
number of components. We can see that after 5 components,
Decision Trees, Random Forest, XGBoost and Logistic re-
there is not much change in the variance percentage. We
gression. The following are the initial results of running
have used top 7 significant features returned by PCA to per-
these classification techniques on our data set.
form prediction. This would decrease the amount of time
Naive Bayes taken for training the model.
Accuracy - 77.47 Precision - 36.5
Recall - 4.65
Logistic Regression
Accuracy - 78.17 Precision - 40.48
Recall - 0.49
KNN
Accuracy - 74.55 Precision - 29.074
Recall - 11.72
3. current results The following are the results obtained as a
SVM
result of using SMOTE[5] and PCA[6] and then performing
Accuracy - 78.0 Precision - 28. 9 prediction. First, we display the confusion matrix results for
Recall - 12.3 each classification technique and then present the accuracy,
precision and recall in the tabular format.
XGboost
Accuracy - 78.22 Precision - 53.1 Naive Bayes
0 1
Recall - 7.3 0 17124 9162
Random Forest
1 4248 8405
Accuracy - 78.22 Precision - 50.13 Logistic Regression
0 1
Recall - 1.12 0 20238 6048
Decision tree 1 5162 7491
Accuracy - 77.14 Precision - 31.5 Random Forest
0 1
Recall - 4.25 0 25418 868
1 2779 9874
2. Improvements
KNN
From the preliminary results, we can see that the precision 0 1
is really low, which is because it is not classifying one class 0 19802 6484
of records properly. 1 4383 8270
Decision Tree CONCLUSION
0 1 We have significantly improved the accuracy and precision
0 25301 985 of predicting a loan defaulter. We also found that Neural
1 911 11742 Networks gives the best prediction with a high precision of
SVM 94.2. Decision Tree is also comparable with a high accuracy
0 1 of 95.13. We have some ideas on how this can be further
0 23098 3188 improved. This could be the future work for the project. The
1 4966 7687 accuracy can be further improved by creating an ensemble.
Gradient Boosting We can also use data and feature engineering techniques to
0 1 further improve accuracy. The Neural Networks can be further
0 24653 1633 fine tuned and more models can be explored.
1 1688 10965
References
Neural Networks [1] 2019. LT Vehicle Loan Default Prediction. (2019).
0 1 https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-
0 25628 658 predictiondata_dictionary.csv
1 1972 10681 [2] T. Cover &P. Hart Mickey Haggblade.Nearest neighbor pattern classifica-
Final Results summarized tion, IEEE Transactions on Information Theory 2013.
The following shows the final results of all the Algorithms [3] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recog-
run so far. We can see that Neural Networks gives the best nition,” submitted to Data Mining and Knowledge Discovery, 1998.
prediction with a high precision of 94.2. Decision Tree is [4] Lloyd, Stuart P. "Least squares quantization in PCM." Information Theory,
also comparable with a high accuracy of 95.13. IEEE Transactions on 28.2 (1982): 129-137.
[5] Nitesh V. Chawla , Kevin W. Bowyer. SMOTE: synthetic minority over-
sampling technique, Journal of Artificial Intelligence Research archive Vol-
ume 16 Issue 1, January 2002
[6] Hervé Abdi, Lynne J. Williams. Principal component analysis, WIREs
Computational Statistics, 2010
[7] Aadhar, https://uidai.gov.in/
[8] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[9] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[10] Leo Breiman, Random Forests, Journal Machine Learning, Volume 45
Issue 1, October 1 2001, Pages 5-32

You might also like