0% found this document useful (0 votes)
66 views17 pages

PA v0.21

LendingClub is a peer-to-peer lending company headquartered in San Francisco. It was the first peer-to-peer lender to register securities offerings with the SEC and offer loan trading on a secondary market. The author built models using random forest and decision trees on LendingClub loan data to predict whether borrowers will repay loans based on historical data. Python with packages like Pandas, Numpy and Matplotlib were used for exploratory data analysis and modeling.

Uploaded by

Sai Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views17 pages

PA v0.21

LendingClub is a peer-to-peer lending company headquartered in San Francisco. It was the first peer-to-peer lender to register securities offerings with the SEC and offer loan trading on a secondary market. The author built models using random forest and decision trees on LendingClub loan data to predict whether borrowers will repay loans based on historical data. Python with packages like Pandas, Numpy and Matplotlib were used for exploratory data analysis and modeling.

Uploaded by

Sai Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer


lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on
a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

My Goal
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), I have built
a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a
new potential customer we can assess whether or not they are likely to pay back the loan.

Model Used
1. Random Forest
2. Decision Tree
3. Neural Network
Language/Analytics Tools Used
1. Python – Jupyter Notebook

Modules used
2. Pandas
3. Numpy
4. Matplotlib
Data Set Overview: copy from
https://github.com/vishrut18/Data-Science-and-ML-Projects/blob/master/1.%20LendingClub%20Loan_Status%20Predictive%2
0model%20using%20Decision%20Tress%20and%20Random%20Forests.ipynb
In the form of table
27 columns
Mention it is categorical/Numerical/Ordinal in 1 column
2 files – 1st for data and 2nd for field description

Data set : Subset of All Lending Club loan data


https://www.kaggle.com/wordsforthewise/lending-club

Number of Rows and columns: 303704 and 26


EXPLORATORY DATA ANALYSIS

OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and
visualize the data

As we can see, this is really an imbalanced


problem. We have lot more entries of people that
fully pay off their loans than the ones that did not
pay back.

Ratio: XX:YY

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard


amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect


correlation with the 'installment' feature. Lets
Explore this feature further.

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard


amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a


the Loan Amount.

The loan status is not too dependant on the loan_amount. Although


the 'Charged off' status has relatively higher loan amount, which
intuitively does makes sense. We can also see this with the
summary statistics for the loan amount, grouped by the loan_status.
# Summary statistics for the loan amount, grouped by the loan_status.

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.

# Lets display a count plot per subgrade


To get a correlation between numeric features and loan_status, first lets create a new column 'loan_repaid' which
contains 1 if the status is 'Fully Paid' and 0 if its 'Charged Off'

# Now lets create a bar plot showing this correlations


Step 2 – Cleaning the data Make A table for each column evaluated , final action
1. Missing data Analysis (dropped/transformed) , and reason
Checking for the Title and emp_length

Charge off rates are extremely similar across all


employment lengths. Lets drop the emp_length column.

138448 unique values , which are subgrouped in


many categories -> Removing

Looks like the title column is simply a string


subcategory/description of the purpose column. So lets drop
the title column.
Mort account – 10% values missing – Strategy?
Lets see the correlation of the mort_acc column with other features

Looks like the total_acc feature correlates with the mort_acc, Let's fill in the missing mort_acc values based
and this makes sense! So, i'll use this fillna() approach. Lets total_acc value. If the mort_acc is missing, th
group the dataframe by the total_acc and calculate the mean that missing value with the mean value corres
value for the mort_acc per total_acc entry. total_acc value from the Series above.

revol_util and the pub_rec_bankruptcies have missing data


points, but they account for less than 0.5% of the total data.
Lets remove the rows that are missing those values in those
columns
# List of all the columns that are currently non-numeric.

Column Name Description(Short) Features Operation done Reason Example (Initial


and Final)
Term The number of 36 months-> Lets convert the
payments on the 230928 term feature into
loan. Values are in 60 months 72147 either a 36 or 60
months and can be integer numeric
either 36 or 60. data type.
'grade' We already know
grade is part of
sub_grade, so lets
just drop the grade
feature.
'sub_grade', '
home_ownership',
'verification_status'
, 'issue_d',
'loan_status',
'purpose',
'earliest_cr_line',
'initial_list_status',
'application_type’,a
dress
Decision Tree Classifier

As we saw earlier, the problem is this dataset is highly


skewed with lot more class 1 data points than class 0. With
this is mind, the accuracy of this model is not too bad
actually (83%). But, as I expected, this model is
misclassifying a lot of Class 0 points (Loan_status: Charged
Off) with f1-score for class 0 being 0.58. Lets see how the
random forests model perform.
Random Forest
Column Name Description(Short) Features Operation done Reason Example (Initial and Final)

Term The number of payments 36 months-> 230928 Lets convert the term
on the loan. Values are in 60 months 72147 feature into either a 36 or
months and can be either 60 integer numeric data
36 or 60. type.

'grade' Lending club assigned Grades are assigned as We already know grade is It is already available with
Loan grade { A,B,C,D,E,F,G } part of sub_grade, so lets other feature so it can be
just drop the grade dropped
feature.

'sub_grade', ' Lending club assigned Sub Sub Grades are assigned convert the subgrade into
Grade for Loan as A1,A2,A3,A4,A5 to G5 dummy variables and
concatenate these new
columns to the original
data frame.

1. verification_status 1. Indicates if Income Convert these fields into


2. initial_list_status or its source is dummy variables and
3. application_type verified by Lending concatenate these fields
4. purpose club with original data frame
2. Initial listing status
of the loan.
3. Individual
application or joint
application
4. Category provided
by borrower for loan
request
Column Name Description(Short) Features Operation done Reason Example (Initial and Final)

home_ownership The home ownership Values are Convert these to dummy We can reduce the
status provided by the ‘MORTGAGE’,’RENT’, variables. Replace ‘NONE’ categories to 4 by using
borrower during ’OWN’,’NONE’,’ANY’, and ‘ANY’ with ‘OTHER’ . this
registration or obtained ’OTHER’ Concatenate them with
from the credit report original data frame.

address State provided by the Contains the complete make this zip_code
borrower in loan address including zip code column into dummy
application variable and concatenate
the result and drop the
original zip_code column
along with dropping the
address column

Issue_d The month in which loan It would be a data Model will not predict
was funded leakage because from our beforehand whether the
model we wont know loan is issued or not
beforehand whether loan
was issued or not. So we
should drop this column

earliest_cr_line The number of open Extract the year from this It is a historic stamp
credit lines in the feature and convert it feature and need not be
borrower's credit file. into a numeric feature converted into dummy
variable as year can be
treated as continuous
data type

loan_status Current status of Loan Drop the loan_status Loan_repaid column


column since it is values is available in 0 and
duplicate of the 1. So we will use that only
loan_repaid column

You might also like