Credit Modelling in R
Credit Modelling in R
library(DescTools)
library(skimr)
library(plyr)
library(dplyr)
library(aod)
library(readxl)
# Visualization packages
library(Deducer)
library(ggplot2)
library(ROCR)
library(pROC)
library(caret)
library(MASS)
Now it is time to load the dataset and do some data management. We will work with
the loan lending club dataset. The below coding is the data management:
# Import dataset
loan_data <-
loan_data[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_sta
tus","emp_length",
"home_ownership",
"annual_inc","verification_status","purpose","dti",
"delinq_2yrs","addr_state","int_rate",
"inq_last_6mths","mths_since_last_delinq",
"mths_since_last_record","open_acc","pub_rec"
,"revol_bal","revol_util","total_acc")]
loan_data$mths_since_last_delinq[is.na(loan_data$mths_since_last_delinq
)] <- 0
loan_data$mths_since_last_record[is.na(loan_data$mths_since_last_record
)] <- 0
Although this is the second step of a credit modeling analysis, the visualization step
can be found in my previous article, let us do minimum of visualization in case the
reader only reads this article:
# Visualization of the data
col="black", aes(fill=..count..)) +
loanamount_barchart
ggplotly(p = ggplot2::last_plot())
theme(axis.text.x = element_blank()) +
ggplotly(p = ggplot2::last_plot())
The above coding gives us the following two visualizations:
n obs: 886877
n variables: 23
-- Variable type:factor
--------------------------------------------------------
-- Variable type:numeric
-------------------------------------------------------
loan_data=as.data.frame(loan_data[loan_data$loan_status!="Current", ])
# Annual_inc
# Home_ownership
loan_data$home_ownership[loan_data$home_ownership=="OWN" |
loan_data$home_ownership=="MORTGAGE" ] <- 1
loan_data$home_ownership[loan_data$home_ownership!=1] <- 0
# Dealinq_2yrs
loan_data$delinq_2yrs[loan_data$delinq_2yrs=="0"] <- 0
loan_data$delinq_2yrs[loan_data$delinq_2yrs!= 0] <- 1
loan_data$verification_status =
as.character(loan_data$verification_status)
loan_data$verification_status[loan_data$verification_status ==
"Verified" | loan_data$verification_status == "Source Verified"] = 1
loan_data$verification_status[loan_data$verification_status != 1] = 0
loan_data$verification_status=as.numeric(loan_data$verification_status)
# Dti
loan_data$dti[loan_data$dti!=1] <- 0
# Status
loan_data$loan_status[loan_data$loan_status != 1] <- 0
table(loan_data$loan_status)
PercTable(loan_data$loan_status)
# Grouping variables
loan_data$purpose == "home_improvement"|
loan_data$purpose == "credit_card" ] <- 2
Now it is time to make the machine learning regression analysis. We will work with
multiple logistic regression. Logistic regression is applied when you have a binary
variable (y) to explain. The logistic regression model uses the cumulative distribution
function to estimate the logistic function of the model with a group of explanatory
variables (the x’s). We will work with a stepwise model in order to find a final model
for the logistic regression. The below coding generates the multiple logistic regression
analysis: