PBA Assignment
PBA Assignment
Dataset overview:
The dataset consists of various financial variables, such as loan amounts, interest rates, debt-
to-income (DTI) ratios, annual incomes, loan verification status, loan terms, and loan purposes.
The project's primary goal is to identify and understand the relationships between these
variables and how they impact loan approval, loan amounts, and risk assessment.
Problem statement:
The dataset represents information related to auto loans, encompassing borrower details, loan
characteristics, and payment information. The primary objective could be to analyze factors
influencing loan status, predict loan default risk, or assess the relationship between borrower
characteristics and loan terms. Such analysis can help in improving loan approval processes,
setting interest rates, and devising strategies for risk management.
Methodology:
Through the application of histograms, box plots, bar plots, scatter plots, pair plots, and
heatmaps, we will examine the distribution, correlation, and relationship between different
variables in the dataset. This visual approach will facilitate communication and understanding
of complex financial relationships, enabling better-informed decision-making for lenders and
financial institutions.
1
In the following sections, we will present the analysis, interpretation, and implications of the
data visualizations, providing insights into loan approval, risk assessment, and lending
strategies.
DATA COLLECTION
Data Collection for Comprehensive Loan Information Dataset
Overview
The Comprehensive Loan Information for Credit Risk dataset provides detailed information
about loan applicants and their associated loan characteristics. This dataset is sourced from
Kaggle and is specifically designed for credit risk assessment.
Dataset Details
Source: Kaggle Comprehensive Loan Information Dataset
Provider: LendingClub
Variables (Features)
Id: Unique identifier for each loan application.
Loan Status: Current status of the loan (e.g., “Current,” “Late,” “Fully Paid”).
2
Sub-Grade: Loan sub-grade assigned by the lender.
Monthly Loan Payment Amount: Monthly payment amount for the loan.
Interest Rate of the Loan: Annual interest rate charged on the loan.
Significance
• The Comprehensive Loan Information dataset provides insights into borrowers’
financial profiles, loan characteristics, and repayment behavior.
• Researchers and financial institutions can use this data to develop credit risk models,
assess lending strategies, and improve decision-making.
This report outlines the data collection process for the Comprehensive Loan Information for
Credit Risk dataset sourced from Kaggle. The dataset’s features offer valuable information for
understanding loan applicants and their loan-related attributes.
Analysis:
The graph visualizes the relationship between loan status and different home ownership
categories.
3
Key observations:
Home Ownership Categories:
Most loans are associated with individuals who either rent or have a mortgage. These two
categories have significantly higher counts than “OWN” or “OTHER.”
Almost no one in the “NONE” category has taken out a loan according to this dataset.
Loan Status:
For all types of home ownership except “NONE” (which has no data), most of the loans are
current.
Charged Off loans are relatively low in number across all ownership categories.
Fully Paid loans are more common than Charged Off loans.
Interpretation:
Borrowers who rent or have a mortgage are more likely to take out loans.
The “OWN” category has fewer total loans but follows a similar trend, with most loans being
current.
4
The lack of data for “NONE” indicates that very few individuals fall into this category.
Implications:
Risk Assessment: Understanding loan status by home ownership helps assess risk. For
example, if most “OWN” loans are current, it suggests that homeowners are generally reliable
borrowers.
Business Decisions: Lenders can tailor their strategies based on ownership types. For instance,
offering better terms to mortgage holders might attract more business
Market Insights: The data provides insights into housing trends and loan behavior across
different ownership categories.
Analysis:
The boxplot displays the following information for each loan grade:
Median (Q2): The horizontal line inside each box represents the median loan amount for that
grade.
5
Interquartile Range (IQR): The box represents the middle 50% of loan amounts (from Q1 to
Q3).
Whiskers: The vertical lines extending from the box indicate the range of data within 1.5 times
the IQR.
Outliers: Individual data points beyond the whiskers are considered outliers.
Key observations:
There are many outliers in all grades, especially in grade D, indicating that there are several
loans that are significantly higher than the rest.
Interpretation:
The variation in median values across different grades suggests that individuals with different
grades receive loans of varying amounts.
The presence of many outliers could indicate inconsistencies or exceptions in how loans are
awarded. It might be worth investigating these outliers further.
6
The higher median loan amount in grade G could imply that borrowers with lower credit grades
seek larger loans.
Implications:
Lending Strategy: Lenders should investigate their grading system and its correlation with
loan amounts. Adjusting grading criteria might lead to more consistent loan amounts.
Risk Assessment: Understanding the distribution of loan amounts by grade helps assess risk.
Higher loan amounts in certain grades might indicate higher default risk.
Customer Segmentation: Consider segmenting borrowers based on their credit grades and
loan preferences to tailor services accordingly.
Analysis:
The bar plot displays the following information:
The height of each bar corresponds to the average loan amount for that verification status.
Error bars are present on each bar, indicating variability in each category.
7
Interpretation:
Borrowers whose income or other information has been verified tend to receive higher loan
amounts (as indicated by the “Verified” category).
Loans with “Not Verified” status have the lowest average loan amounts.
Implications:
Risk Assessment: Lenders might consider the verification status as an indicator of borrower
reliability. Verified borrowers may be perceived as less risky.
Lending Decisions: Offering larger loans to verified borrowers could attract more business.
Fairness and Consistency: Lenders should ensure that verification processes are consistent and
fair across all applicants.
8
Analysis:
There’s a peak near the lower end of the income scale, indicating a concentration of low-income
earners.
The KDE line confirms the peak near zero and the rapid decline afterward.
Interpretation:
Income Disparities:
The high frequency of low-income earners suggests income disparities within this dataset.
9
Many individuals have annual incomes close to zero or very low.
Wealth Distribution:
The sharp decline in frequency as income increases indicates that fewer people earn higher
incomes.
The long tail on the right side represents a smaller number of high-income earners.
Implications:
Social Equity: Addressing income disparities is crucial for promoting social equity.
Targeted Services: Financial institutions can tailor services based on income levels.
Risk Assessment: Understanding income distribution helps assess risk associated with loans or
credit decisions.
Analysis:
The height of each bar corresponds to the average loan amount for that verification status.
Error bars are present on each bar, indicating variability in each category.
Interpretation:
10
Borrowers whose income or other information has been verified tend to receive higher loan
amounts (as indicated by the “Verified” category).
Loans with “Not Verified” status have the lowest average loan amounts.
Implications:
Risk Assessment: Lenders might consider the verification status as an indicator of borrower
reliability. Verified borrowers may be perceived as less risky.
Lending Decisions: Offering larger loans to verified borrowers could attract more business.
Fairness and Consistency: Lenders should ensure that verification processes are consistent and
fair across all applicants.
11
Analysis:
The box plot displays the following information:
Two distinct boxes represent two different loan terms: “36 months” and “60 months.”
The orange box represents the “36 months” term, and the blue box represents the “60 months”
term.
The horizontal line represents the median loan amount for that term.
The box represents the interquartile range (IQR), covering the middle 50% of loan amounts.
The whiskers extend from the box to the minimum and maximum values within 1.5 times the
IQR.
Interpretation:
12
The median loan amount is lower.
Implications:
Risk and Duration: Longer-term loans (60 months) tend to have larger loan amounts, but they
also come with higher risk due to the wider spread of data.
Borrower Preferences: Borrowers might choose longer terms for larger loans, but lenders
should carefully assess risk associated with extended repayment periods.
Lending Strategy: Lenders can tailor their offerings based on borrower preferences and risk
tolerance for different loan terms.
Analysis:
The bar plot displays the following information:
The height of each bar corresponds to the average loan amount for that purpose.
13
Interpretation:
“Major Purchase,” “Small Business,” and “Home Improvement” have the highest average loan
amounts.
“Credit Card” and “Debt Consolidation” have similar average loan amounts.
Implications:
Lending Strategies: Lenders can tailor their loan products based on specific purposes. For
example:
Risk Assessment: Different loan purposes may carry varying levels of risk. Lenders should
consider this when evaluating loan applications.
14
Customer Segmentation: Understanding loan purposes helps segment borrowers and design
targeted marketing campaigns.
Analysis:
The histogram displays the distribution of debt-to-income (DTI) ratios.
The x-axis represents different ranges of DTI ratios, while the y-axis represents the frequency
(number of occurrences) of each DTI ratio.
The KDE (Kernel Density Estimation) line provides a smoothed estimate of the underlying
probability density function.
15
Interpretation:
Most individuals in the dataset have DTI ratios clustered around 0.10 to 0.20.
There are very few individuals with extremely low or high DTI ratios.
The KDE line shows that the data is skewed toward lower DTI ratios.
Implications:
Risk Assessment:
A higher DTI ratio indicates that an individual has more debt relative to their income.
Lenders should be cautious when approving loans for individuals with high DTI ratios.
Financial Health:
Understanding the distribution of DTI ratios helps financial institutions assess the financial
health of their customers.
16
Analysis:
The scatter plot displays individual data points, each representing a borrower.
The x-axis represents “Annual Income,” while the y-axis represents “Loan Amount.”
Data points are color-coded based on their “loan_status”: Charged Off (green), Fully Paid
(orange), and Current (blue).
Most data points cluster toward the lower end of annual income, indicating that many loans are
taken by individuals with lower incomes.
There’s a concentration of fully paid loans (orange) at lower annual incomes and smaller loan
amounts.
Interpretation:
17
Individuals with lower annual incomes tend to take smaller loan amounts.
Risk Assessment:
The presence of charged-off loans (green) among those with low annual income suggests
higher risk for default.
Lenders should be cautious when approving loans for individuals with low income.
Understanding this distribution helps financial institutions assess the financial health of their
customers.
Implications:
Risk Mitigation:
Lenders might consider implementing more stringent lending criteria for applicants with lower
annual incomes.
Tailored Services:
Offering financial literacy programs or personalized advice to borrowers with low income can
improve financial well-being.
Targeted loan products for specific income segments can enhance customer satisfaction.
18
Analysis:
The pair plot displays scatter plots for different combinations of two variables
(‘annual_income’ and ‘loan_amount’) and histograms or kernel density estimates for individual
variables.
The hue parameter colors the data points based on the ‘grade’ column.
Interpretation:
There’s no clear linear relationship between annual income and loan amount.
Most data points cluster toward the lower end of annual income and loan amount.
As annual income increases, there’s a spread in loan amounts but not as densely populated.
Certain grades seem associated with specific ranges of annual incomes and loan amounts.
19
For example, Grade A loans tend to have higher annual incomes and smaller loan amounts.
Implication:
Financial analysts or credit officers could use such visualizations to identify patterns or trends
that might inform credit risk assessments or lending decisions.
Understanding the relationships between these variables can guide lending strategies and risk
management.
CORRELATION MATRIX
Analysis:
The heatmap displays the pairwise correlation coefficients between five financial variables:
‘annual_income’, ‘loan_amount’, ‘dti’ (debt-to-income ratio), ‘total_acc’ (total accounts), and
‘int_rate’ (interest rate).
The color scale ranges from cool (blue) to warm (red), representing negative to positive
correlations.
20
Interpretation:
There’s a positive correlation (0.27), indicating that as annual income increases, the loan
amount tends to increase as well.
A weak negative correlation (-0.12) suggests that higher income may be associated with a lower
debt-to-income ratio.
Positive correlation (0.31) indicates that larger loans are associated with higher interest rates.
Indicates that individuals with more accounts tend to have higher debt-to-income ratios.
21
Implications:
Risk Assessment:
For example, applicants with higher incomes might be eligible for larger loans.
The positive correlation between loan amount and interest rate suggests that risk assessment
varies with loan size; larger loans might be considered riskier.
Portfolio Diversification:
Understanding these correlations can guide investment decisions, especially when considering
diversification across different loan types.
Analysis:
The histogram shows the frequency of different interest rate values in a dataset.
The KDE (Kernel Density Estimation) line provides a smoothed estimate of the underlying
probability density function.
Interest rates around 0.075 and 0.125 are most common, as indicated by their higher
frequencies.
22
Interpretation:
The distribution is right-skewed, meaning there are some loans with relatively high-interest
rates, but they are less frequent.
Most data points cluster toward the lower end of interest rates.
There are two prominent peaks in this distribution, indicating that there are two groups or
clusters within this data with distinct average interest rates.
Implications:
Risk and Return Profiles:
Lenders can use this information to assess risk and return profiles associated with different
interest rates.
Borrowers can gauge typical interest rates they might expect when seeking loans.
23
Product Offerings:
Financial institutions can design loan products tailored to specific interest rate ranges based on
customer preferences and risk tolerance.
IMPLEMENTATION
CONCLUSION
24