PA v0.21
PA v0.21
My Goal
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), I have built
a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a
new potential customer we can assess whether or not they are likely to pay back the loan.
Model Used
1. Random Forest
2. Decision Tree
3. Neural Network
Language/Analytics Tools Used
1. Python – Jupyter Notebook
Modules used
2. Pandas
3. Numpy
4. Matplotlib
Data Set Overview: copy from
https://github.com/vishrut18/Data-Science-and-ML-Projects/blob/master/1.%20LendingClub%20Loan_Status%20Predictive%2
0model%20using%20Decision%20Tress%20and%20Random%20Forests.ipynb
In the form of table
27 columns
Mention it is categorical/Numerical/Ordinal in 1 column
2 files – 1st for data and 2nd for field description
OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and
visualize the data
Ratio: XX:YY
Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.
Looks like the total_acc feature correlates with the mort_acc, Let's fill in the missing mort_acc values based
and this makes sense! So, i'll use this fillna() approach. Lets total_acc value. If the mort_acc is missing, th
group the dataframe by the total_acc and calculate the mean that missing value with the mean value corres
value for the mort_acc per total_acc entry. total_acc value from the Series above.
Term The number of payments 36 months-> 230928 Lets convert the term
on the loan. Values are in 60 months 72147 feature into either a 36 or
months and can be either 60 integer numeric data
36 or 60. type.
'grade' Lending club assigned Grades are assigned as We already know grade is It is already available with
Loan grade { A,B,C,D,E,F,G } part of sub_grade, so lets other feature so it can be
just drop the grade dropped
feature.
'sub_grade', ' Lending club assigned Sub Sub Grades are assigned convert the subgrade into
Grade for Loan as A1,A2,A3,A4,A5 to G5 dummy variables and
concatenate these new
columns to the original
data frame.
home_ownership The home ownership Values are Convert these to dummy We can reduce the
status provided by the ‘MORTGAGE’,’RENT’, variables. Replace ‘NONE’ categories to 4 by using
borrower during ’OWN’,’NONE’,’ANY’, and ‘ANY’ with ‘OTHER’ . this
registration or obtained ’OTHER’ Concatenate them with
from the credit report original data frame.
address State provided by the Contains the complete make this zip_code
borrower in loan address including zip code column into dummy
application variable and concatenate
the result and drop the
original zip_code column
along with dropping the
address column
Issue_d The month in which loan It would be a data Model will not predict
was funded leakage because from our beforehand whether the
model we wont know loan is issued or not
beforehand whether loan
was issued or not. So we
should drop this column
earliest_cr_line The number of open Extract the year from this It is a historic stamp
credit lines in the feature and convert it feature and need not be
borrower's credit file. into a numeric feature converted into dummy
variable as year can be
treated as continuous
data type