0% found this document useful (0 votes)
12 views12 pages

An Kit

This project report focuses on predictive modeling for loan risk management and interest rate optimization in financial lending. It outlines the development of classification and regression models using a dataset of 32,581 entries to predict loan defaults and estimate interest rates based on borrower characteristics. The report details the software requirements, data preprocessing, model training, and evaluation results, demonstrating the effectiveness of the models in enhancing credit risk management.

Uploaded by

Ankit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

An Kit

This project report focuses on predictive modeling for loan risk management and interest rate optimization in financial lending. It outlines the development of classification and regression models using a dataset of 32,581 entries to predict loan defaults and estimate interest rates based on borrower characteristics. The report details the software requirements, data preprocessing, model training, and evaluation results, demonstrating the effectiveness of the models in enhancing credit risk management.

Uploaded by

Ankit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

PREDICTIVE ANALYTICS PROJECT REPORT

PREDICTIVE MODELING FOR LOAN RISK MANAGEMENT AND


INTEREST RATE OPTIMIZATION IN FINANCIAL LENDING
Bachelor of Technology

(COMPUTER SCIENCE ENGINEERING)

Submitted By

ANKIT KUMAR (Registration No. 12113667) (Roll No. 59)

Under the Supervision of

ANKIT KUMAR

LOVELY PROFESSIONAL UNIVERSITY

PUNJAB

NOVEMBER 2024
INDEX

S. No. Topic

1 Declaration

2 Software requirement analysis

3 Introduction

4 Dataset

5 Code Implementation and outputs


DECLARATION

I hereby declare that the project work entitled “PREDICTIVE MODELING FOR LOAN
RISK MANAGEMENT AND INTEREST RATE OPTIMIZATION IN FINANCIAL
LENDING” is an authentic record of my own work carried out as requirements of predictive
analytics project for the award of degree of Bachelor of Technology (COMPUTER
SCIENCE ENGINEERING) from LOVELY PROFESSIONAL UNIVERSITY, PUNJAB
under the guidance of Ankit Thakur during October November 2024.

Ankit Kumar

(registration no.12113667)

Date: 10th November’ 2024

This is to certify that the above statement made by the student is correct to the best
of my knowledge and belief.

(Tanima Thakur , Assistant Professor)


SOFTWARE REQUIREMENT ANALYSIS

To conduct effective predictive modeling for loan risk management and interest rate estimation,
this project requires a robust software environment that supports data preprocessing, statistical
analysis, and machine learning algorithms. The following software tools and packages are
recommended to ensure seamless data handling, model building, and result visualization.
1. Operating System
 Windows : The models and analysis can be conducted on any mainstream operating
system
2. Software Environment
 R Programming Language: R is chosen due to its extensive libraries for statistical
analysis, machine learning, and data visualization. R is well-suited for data science
projects that require a high degree of data manipulation and rapid prototyping.
 RStudio: An integrated development environment (IDE) for R that provides a user-
friendly interface, project management tools, and support for visualization and reporting.
3. Packages and Libraries
 tidyverse: For data manipulation and cleaning, including packages such as dplyr and
ggplot2.
 caret: For data partitioning, model training, cross-validation, and evaluation metrics.
caret simplifies the process of training and tuning multiple models.
 randomForest: For building both classification and regression models. Random Forest is
chosen for its robustness and ability to handle non-linear relationships.
 ggplot2: For creating visualizations, such as feature importance plots, to interpret model
results effectively.
4. Hardware Requirements
 Memory (RAM): 8GB for handling large datasets efficiently.
 Processor: Multi-core processors (Intel i5) to allow faster data processing and model
training.
 Storage: 10GB of free disk space to accommodate datasets, environment dependencies,
and generated results.
5. Data Storage and Management
 CSV File Support: The dataset is provided as a CSV file, which R can readily handle.
INTRODUCTION

In today’s financial landscape, effective risk management is crucial for institutions providing
credit services. Lenders face a persistent challenge in balancing profitability with risk, as loan
defaults can significantly impact financial stability. By leveraging data analytics and predictive
modeling, financial institutions can gain insights into borrower risk profiles and make informed
decisions that reduce the probability of defaults and align interest rates with individual borrower
risk.
This study aims to assist a financial institution in addressing two critical objectives within their
lending processes:
1. Predicting Loan Default: Using historical loan and borrower data, a classification model
is developed to predict whether a borrower is likely to default on a loan. This model
incorporates key borrower demographics, financial status, and credit history, helping the
institution identify high-risk applicants and take proactive measures to mitigate loan
losses.
2. Estimating Loan Interest Rates: To create a customized, risk-adjusted pricing strategy,
we also develop a regression model that estimates the appropriate interest rate for a given
borrower. By predicting interest rates based on borrower characteristics and loan details,
the institution can optimize loan pricing, offering competitive rates that reflect individual
risk levels.
By implementing predictive models for loan default and interest rate estimation, this analysis
offers a data-driven approach for improving credit risk management and interest rate strategies.
The results of this study aim to support lenders in making more accurate, reliable, and
personalized lending decisions, enhancing both operational efficiency and customer satisfaction.
DATASET
The dataset provides a range of borrower and loan-related information, which enables us to
predict loan default risks and estimate interest rates. It consists of 32,581 entries with 12 key
features, covering borrower demographics, loan attributes, and credit history.
Key Columns
1. person_age: Integer - The age of the borrower. Age can be an indicator of financial
maturity, stability, and loan repayment behavior.
2. person_income: Integer - The annual income of the borrower in dollars. This is an
essential predictor, as income levels affect a borrower’s ability to meet repayment
obligations.
3. person_home_ownership: Categorical - The type of home ownership, which includes
categories like "RENT," "OWN," and "MORTGAGE." Home ownership status can
reflect financial stability and assets, influencing creditworthiness.
4. person_emp_length: Float - Employment length in years. Longer employment histories
may indicate job stability, often associated with reduced default risk. Missing values in
this column are imputed with the median value.
5. loan_intent: Categorical - The purpose of the loan, which includes categories such as
"PERSONAL," "EDUCATION," "MEDICAL," "VENTURE," etc. Different loan
purposes might correlate with varying default risks.
6. loan_grade: Categorical - The loan grade assigned by the lender, ranging from A to G.
Loan grades often reflect the borrower’s creditworthiness, with lower grades generally
indicating higher risk.
7. loan_amnt: Integer - The amount of the loan requested by the borrower. Higher loan
amounts may carry greater risk, particularly if the borrower has limited income or a
shorter credit history.
8. loan_int_rate: Float - The interest rate on the loan. This is a critical target variable for
the regression model, as interest rates are determined based on a combination of borrower
characteristics and risk factors. Missing values in this column are filled with the median
interest rate.
9. loan_status: Binary (0 or 1) - The target variable for the classification model. A value of
1 indicates that the loan was defaulted, while 0 indicates that the loan was paid off. This
is the primary variable of interest for assessing default risk.
10. loan_percent_income: Float - The ratio of the loan amount to the borrower's income.
This measure indicates the extent of the financial burden placed on the borrower by the
loan and may be a predictor of default risk.
11. cb_person_default_on_file: Categorical (Y/N) - Indicates whether the borrower has any
prior default record. This is a significant predictor, as borrowers with past defaults may
be at higher risk for future defaults.
12. cb_person_cred_hist_length: Integer - The borrower’s credit history length in years. A
longer credit history is often positively associated with creditworthiness and lower
default risk.
Dataset Usage for Problem Solving
 Classification Problem: The loan_status column is used as the target variable to predict
whether a borrower will default on a loan. Features like person_age, person_income,
loan_grade, and cb_person_default_on_file are key predictors.
 Regression Problem: The loan_int_rate column serves as the target variable for
estimating loan interest rates. Borrower characteristics such as person_income,
loan_grade, loan_percent_income, and cb_person_cred_hist_length contribute to the
model’s ability to predict an appropriate interest rate for each loan.
CODE DESIGN IMPLEMENTATION OF ANALYTICS
# Load required libraries
library(tidyverse) # For data manipulation
library(caret) # For model training and evaluation
library(randomForest) # For random forest models
library(ggplot2) # For visualization

# Load the dataset


credit_data <- read.csv(file.choose())

# Data Preprocessing
# Handle missing values by filling with median values
credit_data$person_emp_length[is.na(credit_data$person_emp_length)] <-
median(credit_data$person_emp_length, na.rm = TRUE)
credit_data$loan_int_rate[is.na(credit_data$loan_int_rate)] <- median(credit_data$loan_int_rate,
na.rm = TRUE)

# Convert categorical columns to factors


credit_data$person_home_ownership <- as.factor(credit_data$person_home_ownership)
credit_data$loan_intent <- as.factor(credit_data$loan_intent)
credit_data$loan_grade <- as.factor(credit_data$loan_grade)
credit_data$cb_person_default_on_file <- as.factor(credit_data$cb_person_default_on_file)
credit_data$loan_status <- as.factor(credit_data$loan_status) # Target variable for classification
# Classification Task: Predicting Loan Default
# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(credit_data$loan_status, p = 0.8, list = FALSE)
trainData <- credit_data[trainIndex, ]
testData <- credit_data[-trainIndex, ]

# Train a Random Forest Classifier


rf_model <- randomForest(loan_status ~ person_age + person_income +
person_home_ownership +
person_emp_length + loan_intent + loan_grade + loan_amnt +
loan_percent_income + cb_person_default_on_file +
cb_person_cred_hist_length,
data = trainData, importance = TRUE)
# Predict on the test data and evaluate
predictions_class <- predict(rf_model, newdata = testData)
confusion_matrix <- confusionMatrix(predictions_class, testData$loan_status)
print(confusion_matrix)

Confusion Matrix and Statistics


Reference
Prediction 0 1
0 5060 393
1 34 1028
Accuracy : 0.9345
95% CI : (0.9282, 0.9403)
No Information Rate : 0.7819
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7886
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9933
Specificity : 0.7234
Pos Pred Value : 0.9279
Neg Pred Value : 0.9680
Prevalence : 0.7819
Detection Rate : 0.7767
Detection Prevalence : 0.8370
Balanced Accuracy : 0.8584
'Positive' Class : 0

# Plot Feature Importance for Classification Model


importance <- importance(rf_model)
importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance in Loan Default Prediction", x = "Features", y = "Importance")
# Regression Task: Predicting Loan Interest Rate
# Filter out rows with missing loan_int_rate values
trainData_reg <- trainData[!is.na(trainData$loan_int_rate),]
testData_reg <- testData[!is.na(testData$loan_int_rate),]

# Train a Random Forest Regressor


rf_regressor <- randomForest(loan_int_rate ~ person_age + person_income +
person_home_ownership +
person_emp_length + loan_intent + loan_grade + loan_amnt +
loan_percent_income + cb_person_cred_hist_length,
data = trainData_reg, importance = TRUE)
# Predict on the test data
predictions_reg <- predict(rf_regressor, newdata = testData_reg)

# Calculate and display Mean Squared Error (MSE)


mse <- mean((predictions_reg - testData_reg$loan_int_rate)^2)
cat("Mean Squared Error for Loan Interest Rate Prediction:", mse, "\n")

Mean Squared Error for Loan Interest Rate Prediction: 1.696639

# Plot Feature Importance for Regression Model


importance_reg <- importance(rf_regressor)
importance_df_reg <- data.frame(Feature = rownames(importance_reg), Importance =
importance_reg[, 1])
ggplot(importance_df_reg, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "darkorange") +
coord_flip() +
labs(title = "Feature Importance in Loan Interest Rate Prediction", x = "Features", y =
"Importance")

You might also like