0% found this document useful (0 votes)

66 views17 pages

PA v0.21

LendingClub is a peer-to-peer lending company headquartered in San Francisco. It was the first peer-to-peer lender to register securities offerings with the SEC and offer loan trading on a secondary market. The author built models using random forest and decision trees on LendingClub loan data to predict whether borrowers will repay loans based on historical data. Python with packages like Pandas, Numpy and Matplotlib were used for exploratory data analysis and modeling.

Uploaded by

Sai Pawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views17 pages

PA v0.21

Uploaded by

Sai Pawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer

lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on
a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

My Goal
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), I have built
a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a
new potential customer we can assess whether or not they are likely to pay back the loan.

Model Used
1. Random Forest
2. Decision Tree
3. Neural Network
Language/Analytics Tools Used
1. Python – Jupyter Notebook

Modules used
2. Pandas
3. Numpy
4. Matplotlib
Data Set Overview: copy from
https://github.com/vishrut18/Data-Science-and-ML-Projects/blob/master/1.%20LendingClub%20Loan_Status%20Predictive%2
0model%20using%20Decision%20Tress%20and%20Random%20Forests.ipynb
In the form of table
27 columns
Mention it is categorical/Numerical/Ordinal in 1 column
2 files – 1st for data and 2nd for field description

Data set : Subset of All Lending Club loan data

https://www.kaggle.com/wordsforthewise/lending-club

Number of Rows and columns: 303704 and 26

EXPLORATORY DATA ANALYSIS

OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and
visualize the data

As we can see, this is really an imbalanced

problem. We have lot more entries of people that
fully pay off their loans than the ones that did not
pay back.

Ratio: XX:YY

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect

correlation with the 'installment' feature. Lets
Explore this feature further.

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a

the Loan Amount.

The loan status is not too dependant on the loan_amount. Although

the 'Charged off' status has relatively higher loan amount, which
intuitively does makes sense. We can also see this with the
summary statistics for the loan amount, grouped by the loan_status.
# Summary statistics for the loan amount, grouped by the loan_status.

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.

# Lets display a count plot per subgrade

To get a correlation between numeric features and loan_status, first lets create a new column 'loan_repaid' which
contains 1 if the status is 'Fully Paid' and 0 if its 'Charged Off'

# Now lets create a bar plot showing this correlations

Step 2 – Cleaning the data Make A table for each column evaluated , final action
1. Missing data Analysis (dropped/transformed) , and reason
Checking for the Title and emp_length

Charge off rates are extremely similar across all

employment lengths. Lets drop the emp_length column.

138448 unique values , which are subgrouped in

many categories -> Removing

Looks like the title column is simply a string

subcategory/description of the purpose column. So lets drop
the title column.
Mort account – 10% values missing – Strategy?
Lets see the correlation of the mort_acc column with other features

Looks like the total_acc feature correlates with the mort_acc, Let's fill in the missing mort_acc values based
and this makes sense! So, i'll use this fillna() approach. Lets total_acc value. If the mort_acc is missing, th
group the dataframe by the total_acc and calculate the mean that missing value with the mean value corres
value for the mort_acc per total_acc entry. total_acc value from the Series above.

revol_util and the pub_rec_bankruptcies have missing data

points, but they account for less than 0.5% of the total data.
Lets remove the rows that are missing those values in those
columns
# List of all the columns that are currently non-numeric.

Column Name Description(Short) Features Operation done Reason Example (Initial

and Final)
Term The number of 36 months-> Lets convert the
payments on the 230928 term feature into
loan. Values are in 60 months 72147 either a 36 or 60
months and can be integer numeric
either 36 or 60. data type.
'grade' We already know
grade is part of
sub_grade, so lets
just drop the grade
feature.
'sub_grade', '
home_ownership',
'verification_status'
, 'issue_d',
'loan_status',
'purpose',
'earliest_cr_line',
'initial_list_status',
'application_type’,a
dress
Decision Tree Classifier

As we saw earlier, the problem is this dataset is highly

skewed with lot more class 1 data points than class 0. With
this is mind, the accuracy of this model is not too bad
actually (83%). But, as I expected, this model is
misclassifying a lot of Class 0 points (Loan_status: Charged
Off) with f1-score for class 0 being 0.58. Lets see how the
random forests model perform.
Random Forest
Column Name Description(Short) Features Operation done Reason Example (Initial and Final)

Term The number of payments 36 months-> 230928 Lets convert the term
on the loan. Values are in 60 months 72147 feature into either a 36 or
months and can be either 60 integer numeric data
36 or 60. type.

'grade' Lending club assigned Grades are assigned as We already know grade is It is already available with
Loan grade { A,B,C,D,E,F,G } part of sub_grade, so lets other feature so it can be
just drop the grade dropped
feature.

'sub_grade', ' Lending club assigned Sub Sub Grades are assigned convert the subgrade into
Grade for Loan as A1,A2,A3,A4,A5 to G5 dummy variables and
concatenate these new
columns to the original
data frame.

1. verification_status 1. Indicates if Income Convert these fields into

2. initial_list_status or its source is dummy variables and
3. application_type verified by Lending concatenate these fields
4. purpose club with original data frame
2. Initial listing status
of the loan.
3. Individual
application or joint
application
4. Category provided
by borrower for loan
request
Column Name Description(Short) Features Operation done Reason Example (Initial and Final)

home_ownership The home ownership Values are Convert these to dummy We can reduce the
status provided by the ‘MORTGAGE’,’RENT’, variables. Replace ‘NONE’ categories to 4 by using
borrower during ’OWN’,’NONE’,’ANY’, and ‘ANY’ with ‘OTHER’ . this
registration or obtained ’OTHER’ Concatenate them with
from the credit report original data frame.

address State provided by the Contains the complete make this zip_code
borrower in loan address including zip code column into dummy
application variable and concatenate
the result and drop the
original zip_code column
along with dropping the
address column

Issue_d The month in which loan It would be a data Model will not predict
was funded leakage because from our beforehand whether the
model we wont know loan is issued or not
beforehand whether loan
was issued or not. So we
should drop this column

earliest_cr_line The number of open Extract the year from this It is a historic stamp
credit lines in the feature and convert it feature and need not be
borrower's credit file. into a numeric feature converted into dummy
variable as year can be
treated as continuous
data type

loan_status Current status of Loan Drop the loan_status Loan_repaid column

column since it is values is available in 0 and
duplicate of the 1. So we will use that only
loan_repaid column

Posthumanism and Deconstructing Arguments Corpora and Digitallydriven Critical Analysis Kieran Ohalloran Instant Download
No ratings yet
Posthumanism and Deconstructing Arguments Corpora and Digitallydriven Critical Analysis Kieran Ohalloran Instant Download
82 pages
Worksheet Fundamental Unit of Life
No ratings yet
Worksheet Fundamental Unit of Life
3 pages
OPINION STRUCTURE For FAST
No ratings yet
OPINION STRUCTURE For FAST
3 pages
Standard Bank Home Loan Prediction
No ratings yet
Standard Bank Home Loan Prediction
11 pages
Thermal Facial Analysis For Deception Detectio
No ratings yet
Thermal Facial Analysis For Deception Detectio
9 pages
Shahin Reg Company
No ratings yet
Shahin Reg Company
7 pages
SanatKulkarni - AP22110010183 - Assignment3-1
No ratings yet
SanatKulkarni - AP22110010183 - Assignment3-1
4 pages
EDA Report
No ratings yet
EDA Report
2 pages
Medieval French Literature - Doctor of Philosophy (PH.D.) in French by Slidesgo
No ratings yet
Medieval French Literature - Doctor of Philosophy (PH.D.) in French by Slidesgo
55 pages
Chapter 3
No ratings yet
Chapter 3
41 pages
Anaesthesia
No ratings yet
Anaesthesia
7 pages
Capstone Presentation Final
No ratings yet
Capstone Presentation Final
14 pages
NUS - SOC - AML - Required Capstone Project
No ratings yet
NUS - SOC - AML - Required Capstone Project
5 pages
Finclub Summer Project 2 (2025)
No ratings yet
Finclub Summer Project 2 (2025)
7 pages
Member 17
No ratings yet
Member 17
21 pages
Ads 9
No ratings yet
Ads 9
8 pages
Action Plan For Frust and Inst.
No ratings yet
Action Plan For Frust and Inst.
9 pages
Establishing and Maintaining Mentoring Relationships An Overview of Mentor and Mentee Competencies
No ratings yet
Establishing and Maintaining Mentoring Relationships An Overview of Mentor and Mentee Competencies
8 pages
Experiment 5
No ratings yet
Experiment 5
5 pages
Kritika Sejwal 24MCI10023 ML Lab Project Report
No ratings yet
Kritika Sejwal 24MCI10023 ML Lab Project Report
10 pages
Credit Default Project 23124001
No ratings yet
Credit Default Project 23124001
13 pages
Final Project Credit Risk - Compressed - Compressed
No ratings yet
Final Project Credit Risk - Compressed - Compressed
27 pages
Director Seymore Butts Tells You The Truth About What Really Happens in Porn Men's Health
No ratings yet
Director Seymore Butts Tells You The Truth About What Really Happens in Porn Men's Health
1 page
Loan Application Approval Prediction
No ratings yet
Loan Application Approval Prediction
14 pages
DN 28092022
No ratings yet
DN 28092022
16 pages
Lending Club Data Analysis and Default
No ratings yet
Lending Club Data Analysis and Default
10 pages
WRITEUP
No ratings yet
WRITEUP
2 pages
Sei Shonagon - The Pillow Book
No ratings yet
Sei Shonagon - The Pillow Book
1 page
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Nicolas Léonard Sadi Carnot - Wikipedia
No ratings yet
Nicolas Léonard Sadi Carnot - Wikipedia
7 pages
Proporsi Penambahan Tepung Tapioka Dan Lama Perebusan Terhadap Kualitas Pempek Ikan Belut (Monopterus Albus)
No ratings yet
Proporsi Penambahan Tepung Tapioka Dan Lama Perebusan Terhadap Kualitas Pempek Ikan Belut (Monopterus Albus)
8 pages
Naive Bayes Vs Logistic Regression
No ratings yet
Naive Bayes Vs Logistic Regression
16 pages
Nursing Intervention For Chest Pain
100% (3)
Nursing Intervention For Chest Pain
2 pages
MSML Project 1
No ratings yet
MSML Project 1
8 pages
Ranvijay 12203409
No ratings yet
Ranvijay 12203409
13 pages
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
No ratings yet
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
9 pages
This Study Resource Was: Bank Loan Default Prediction Model
No ratings yet
This Study Resource Was: Bank Loan Default Prediction Model
9 pages
Blended Learning and Teaching Pros and Cons
0% (1)
Blended Learning and Teaching Pros and Cons
10 pages
Prediciton of Loan Apprval-Project Report
No ratings yet
Prediciton of Loan Apprval-Project Report
82 pages
Final Project Title and Abstract Group-3
No ratings yet
Final Project Title and Abstract Group-3
5 pages
Capstone Project Report v1 - Abhishek Bihani
No ratings yet
Capstone Project Report v1 - Abhishek Bihani
16 pages
Credit Defaulter Classifier 1659348484
No ratings yet
Credit Defaulter Classifier 1659348484
7 pages
Emission Line Studies of Thousands of Galaxies: Grazyna Stasinska
No ratings yet
Emission Line Studies of Thousands of Galaxies: Grazyna Stasinska
10 pages
Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
No ratings yet
Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
13 pages
An Kit
No ratings yet
An Kit
12 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Champion
0% (1)
Champion
12 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Rich But Always Broke PDF
No ratings yet
Rich But Always Broke PDF
48 pages
Basketball Plays, Drills and Practice Plans: Designed For Youth Basketball
100% (1)
Basketball Plays, Drills and Practice Plans: Designed For Youth Basketball
23 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Unleashing the Power of TypeScript
From Everand
Unleashing the Power of TypeScript
Steve Kinney
No ratings yet
Creating Informational Texts Lesson Plan: o Language
No ratings yet
Creating Informational Texts Lesson Plan: o Language
3 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Project Report - Lendingclub - FINAL
No ratings yet
Project Report - Lendingclub - FINAL
24 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
Neal Frisby's Scroll Number 13
No ratings yet
Neal Frisby's Scroll Number 13
3 pages
The Story-Line of The Bible: Craig Bartholomew and Michael Goheen
No ratings yet
The Story-Line of The Bible: Craig Bartholomew and Michael Goheen
7 pages
Block Diagram
100% (1)
Block Diagram
75 pages
Chapter 5 Medical Studies at Ust
No ratings yet
Chapter 5 Medical Studies at Ust
8 pages
AI200 Capstone Project Instructions
No ratings yet
AI200 Capstone Project Instructions
8 pages
Cart Project
75% (4)
Cart Project
17 pages
Pediatrics
100% (1)
Pediatrics
4 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Classification - Bank - Marketing - Dataset - Jupyter Notebook
No ratings yet
Classification - Bank - Marketing - Dataset - Jupyter Notebook
23 pages
Customer Scoring - Case Study
No ratings yet
Customer Scoring - Case Study
15 pages
Final Project Report - Kelompok 4
No ratings yet
Final Project Report - Kelompok 4
6 pages
TM1 SWBL Renel Cuaresma
100% (3)
TM1 SWBL Renel Cuaresma
23 pages
PA v0.7
No ratings yet
PA v0.7
15 pages
PA v0.12
No ratings yet
PA v0.12
9 pages
Predicting Personal Loan Approval Using Machine Learning Handbook
No ratings yet
Predicting Personal Loan Approval Using Machine Learning Handbook
31 pages
Linear Models Reading
No ratings yet
Linear Models Reading
26 pages
Amta Assignment
No ratings yet
Amta Assignment
20 pages
Final Project Making Predictions From Data-Course 2: October 6, 2020
No ratings yet
Final Project Making Predictions From Data-Course 2: October 6, 2020
20 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Vehicle Loan Default Prediction
No ratings yet
Vehicle Loan Default Prediction
14 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Lending Club Data Analysis PDF
No ratings yet
Lending Club Data Analysis PDF
3 pages
PA v0.20
No ratings yet
PA v0.20
17 pages
Machinelearning
No ratings yet
Machinelearning
24 pages
Data Mining Case Study PDF
No ratings yet
Data Mining Case Study PDF
21 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Hp1047, Vmr286 Loan Default Prediction Final Report
No ratings yet
Hp1047, Vmr286 Loan Default Prediction Final Report
8 pages
Credit Risk Modelling (EDA & Classification) - Kaggle
No ratings yet
Credit Risk Modelling (EDA & Classification) - Kaggle
21 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Predictive Modeling: Project Documentation Team 10
No ratings yet
Predictive Modeling: Project Documentation Team 10
16 pages
Advanced Modelling Techniques Anurag Payel
No ratings yet
Advanced Modelling Techniques Anurag Payel
41 pages

PA v0.21

Uploaded by

PA v0.21

Uploaded by

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer

Data set : Subset of All Lending Club loan data

Number of Rows and columns: 303704 and 26

As we can see, this is really an imbalanced

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a

The loan status is not too dependant on the loan_amount. Although

# Lets display a count plot per subgrade

# Now lets create a bar plot showing this correlations

Charge off rates are extremely similar across all

138448 unique values , which are subgrouped in

Looks like the title column is simply a string

revol_util and the pub_rec_bankruptcies have missing data

Column Name Description(Short) Features Operation done Reason Example (Initial

As we saw earlier, the problem is this dataset is highly

1. verification_status 1. Indicates if Income Convert these fields into

loan_status Current status of Loan Drop the loan_status Loan_repaid column

You might also like