PBA
PBA
The Bank Loan Default Risk Analysis project focuses on utilizing a dataset obtained from
Kaggle to identify patterns that indicate potential difficulties faced by clients in paying their loan
installments. The dataset contains a wealth of information related to bank loans, encompassing
borrower profiles, loan details, and loan statuses. The primary objective of this project is to
analyze the dataset comprehensively to uncover patterns or characteristics that may signal a
borrower's likelihood of facing challenges in meeting their loan obligations.
Dataset Overview
The dataset includes a variety of features such as unique identifiers for loans, borrower
characteristics like employment details and income, loan specifics including amounts and terms,
as well as crucial dates related to loan transactions. These features serve as key indicators that
can be leveraged to assess the risk of loan default and make informed decisions regarding loan
approvals and risk management strategies.
Problem Statement
The core problem addressed in this project revolves around identifying patterns within the
dataset that can help predict if a client is likely to encounter difficulties in repaying their loan
installments. By recognizing these patterns, lenders can take proactive measures such as
adjusting loan amounts, offering loans at higher interest rates to risky applicants, or even denying
loans to high-risk individuals. The ultimate goal is to enhance the decision-making process for
lenders, enabling them to manage their loan portfolios effectively and minimize losses resulting
from loan defaults.
Methodology
The chosen methodology for this project encompasses three key components: Univariate
Analysis, Borrower Characteristics Analysis, and Categorical Features Analysis. Each
component plays a vital role in dissecting the dataset to understand loan default risk factors
comprehensively. Through Univariate Analysis, the distribution of loan statuses is examined to
establish baseline default rates. Borrower Characteristics Analysis delves into attributes like
income and employment status to uncover correlations with loan default rates. Categorical
Features Analysis focuses on exploring default rates across different categories like home
ownership and loan purpose to tailor lending strategies accordingly.
Rationale for Methodology
The selected methodology offers a holistic approach to dissecting the dataset, providing a well-
rounded view of borrower behavior and loan characteristics. By combining these analyses, the
project aims to generate valuable insights that can guide lenders in making informed decisions
related to loan approvals, risk assessment, and mitigation strategies. Visualizations are employed
throughout the analysis to enhance clarity and facilitate the identification of trends and
relationships within the data.
DATA COLLECTION
Source: The dataset is obtained from Kaggle (https://www.kaggle.com/datasets/amity024/data-
sources/data). Description: The dataset contains information about bank loans, including various
features related to the borrower's profile, loan details, and loan status. The dataset aims to
identify patterns that indicate whether a client is likely to face difficulties in paying their loan
installments. Features (Variables):
The dataset includes the following features:
PROBLEM STATEMENT
The problem statement for this assignment is: "To identify patterns which indicate if a client has
difficulty paying their installments, which may be used for taking actions such as denying the
loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc."
The goal is to analyze the dataset and identify patterns or characteristics that may indicate a
borrower's potential difficulty in making loan payments. This information can be used by lenders
to make informed decisions, such as denying loans to high-risk applicants, reducing loan
amounts for certain borrower profiles, or offering loans at higher interest rates to mitigate the
risk of default.
By analyzing the provided features and loan status information, the analysis aims to uncover
insights that can help lenders manage their loan portfolios more effectively and minimize
potential losses due to loan defaults.
CHOSEN METHODOLOGY
In the context of the Bank Loan Default Risk Analysis assignment, the chosen methodology
involves three key components: Univariate Analysis, Borrower Characteristics Analysis, and
Categorical Features Analysis. Each component plays a crucial role in understanding patterns
that indicate clients' difficulty in paying their loan installments.
1. Univariate Analysis:
Objective: The primary goal of the Univariate Analysis is to analyze the distribution of loans
across different loan status categories (defaulted, fully paid, charged off) to establish baseline
default rates.
Rationale: By examining the distribution of loan statuses, we can identify the proportion of
loans that have defaulted, been fully paid, or charged off. This analysis provides a foundational
understanding of default rates within the dataset.
Importance: Understanding the baseline default rates is essential for assessing the overall risk
associated with the loans and identifying potential areas of concern.
Objective: This analysis involves examining the distribution of categorical features like
address_state, application_type, home_ownership, and purpose to identify differences in default
rates across categories.
Rationale: By analyzing categorical features, we can assess whether certain borrower profiles
or loan purposes are associated with higher default rates. Understanding these differences can
guide decision-making processes related to loan approvals and risk management.
Importance: Identifying variations in default rates across categorical features enables lenders
to tailor their lending practices and strategies based on specific borrower characteristics and loan
purposes.
The chosen methodology was selected based on its comprehensive approach to analyzing various
aspects of the dataset related to borrower behavior and loan characteristics. Here's why and how
this methodology was chosen:
By employing this methodology, we aim to uncover meaningful patterns and relationships within
the dataset that can assist in making informed decisions regarding loan approvals, risk
assessment, and mitigation strategies.
IMPLEMENTATION
Importing Libraries: The code begins by importing the necessary libraries: pandas for data
manipulation, numpy for numerical operations, matplotlib for data visualization, and seaborn for
creating attractive and informative statistical graphics.
Loading the Datasets: The code loads two datasets, application_data.csv and
previous_application.csv, using the pd.read_csv() function from the pandas library. The column
names of both datasets are printed to ensure they are loaded correctly.
Handling Missing Values: The code replaces the value -1 with np.nan (Not a Number) in both
datasets using the replace() method from pandas. This step is crucial for handling missing values
appropriately during data analysis.
Merging the Datasets: The code merges the application_data and previous_application datasets
using the pd.merge() function from pandas. The datasets are merged based on the common
column SK_ID_CURR, and the how='left' parameter ensures that all rows from the left
DataFrame (application_data) are included in the merged DataFrame, even if there are no
matching rows in the right DataFrame (previous_application). The merged dataset is stored in the
variable merged_data.
Univariate Analysis: Loan Status: The code performs a univariate analysis on the
NAME_CONTRACT_STATUS column, which represents the loan status. It creates a bar plot
using matplotlib and seaborn to visualize the distribution of loan statuses. Additionally, it prints
the loan status counts.
Borrower Characteristics Analysis: The code analyzes the distribution of borrower
characteristics, such as income (AMT_INCOME_TOTAL) and loan amount (AMT_CREDIT_x
and AMT_CREDIT_y), across different loan statuses. It creates boxplots using seaborn to
visualize these distributions.
Categorical Features Analysis: The code examines the default rates by categorical features like
home ownership (NAME_HOUSING_TYPE) and loan purpose
(NAME_CASH_LOAN_PURPOSE). It calculates the default rates for each category using
groupby and value_counts from pandas, and then visualizes the results using stacked bar plots
with matplotlib and seaborn.
Operators and Expressions: The code introduces a new column CREDIT_TO_INCOME_RATIO
in the application_data dataset by dividing AMT_CREDIT by AMT_INCOME_TOTAL. This
demonstrates the use of operators and expressions in data manipulation.
Control Flow and Loop: The code uses a for loop to iterate over the rows of the application_data
dataset and identify clients with a CREDIT_TO_INCOME_RATIO greater than 0.5. The
SK_ID_CURR values of these clients are stored in the high_credit_to_income list, showcasing
the use of control flow statements.
Data Structures: The code prints the unique values in the NAME_HOUSING_TYPE column of
the application_data dataset, demonstrating the use of data structures (lists) in Python.
Visualization: Histogram, Scatter Plot, Pair Plot, and Correlation Heatmap: The code creates
various visualizations using seaborn and matplotlib, including histograms for
CREDIT_TO_INCOME_RATIO, AMT_INCOME_TOTAL, AMT_CREDIT_x,
DAYS_DECISION, and CNT_PAYMENT; a scatter plot of AMT_INCOME_TOTAL vs.
AMT_CREDIT_x colored by NAME_CONTRACT_STATUS; a pair plot of numerical features;
and a correlation heatmap of numerical features. These visualizations aid in exploring and
understanding the relationships between different variables in the dataset.
The .columns attribute returns the column labels of the DataFrame. In this case, the DataFrame
contains information about loan applications, including the following:
SK_ID_PREV: This is an unique identifier for a previous loan application made by the client.
SK_ID_CURR: This is an unique identifier for the current loan application.
NAME_CONTRACT_TYPE: This indicates the type of loan applied for, such as a mortgage or
car loan.
AMT_ANNUITY: This contains the amount the applicant would have to pay monthly if the loan
was approved.
AMT_APPLICATION: This contains the loan amount applied for by the applicant.
AMT_CREDIT: This contains the credit limit granted by the bank.
There are also several features related to the application and approval process, such as the
WEEKDAY_APPR_PROCESS_START and DAYS_DECISION.
Univariate Analysis
Loan Status
The code performs a univariate analysis, focusing on a single variable: loan status.
It generates a bar chart using Matplotlib to visualize the distribution of loan statuses.
The chart title is "Contract Status Distribution".
The x-axis represents different contract statuses.
The y-axis represents the count of loans in each status category.
The bar chart shows the relative proportion of loans in each status category. This reveals
which statuses are most common and which are relatively rare.
Identifies the status categories with the highest counts. These statuses represent the most
common outcomes of loans in the dataset. Consider the implications of these
predominant statuses for the overall health of the loan portfolio.
A high concentration of loans in statuses like "Approved" or "Completed" suggests a low-
risk portfolio, while a significant number of loans in "Defaulted" or "Cancelled" indicates
a higher risk profile.
Loan Status Distribution
Approved Loans: A total of 886,099 loans were approved. This indicates a high number
of successful loan applications.
Canceled Loans: There were 259,441 loans that were canceled. These could be
applications withdrawn by applicants or canceled by the bank for various reasons.
Refused Loans: The count of refused loans stands at 245,398. These are the applications
that the bank has declined, possibly due to credit risk concerns or applicant ineligibility.
Unused Offers: A smaller number, 22,771, represents unused offers. These might be
approved loans that the applicants decided not to proceed with.
Observations:
Approved Loans: Generally have high default rates, which could indicate that the
approval criteria might need to be re-evaluated.
Refused Loans: The default rates vary significantly, possibly reflecting the effectiveness
of the refusal criteria.
Canceled Loans: Have lower default rates, which might be due to applicants self-
selecting out of the process.
Unused Offers: Consistently have the lowest default rates, likely because these do not
result in actual loans.
Building a House or an Annex: Shows a high refusal rate, which could indicate a higher
risk associated with this loan purpose or stringent approval criteria.
Business Development: The code snippet does not provide the full output, but it would
typically show the proportion of each contract status for this loan purpose.
XAP: This category has a significant proportion of unused offers, suggesting that many
applicants do not proceed with their applications.
XNA: Represents cases with unspecified loan purposes, with a very low proportion of
unused offers, indicating most applications in this category are processed.
Control Flow and Loop
Number of clients with credit-to-income ratio>0.5
The output indicates that there are 305,648 clients with a credit-to-income ratio greater
than 0.5.
A high credit-to-income ratio suggests that a significant portion of the client’s income is
dedicated to servicing debt, which could indicate a higher risk of default.
Visualization
Default Rates by Loan Purpose (Bar Graph)
Approved Segment (Green): Shows a higher proportion across all loan purposes,
indicating a larger number of approved contracts.
Canceled Segment (Blue): Varies among the categories, suggesting different cancellation
rates depending on the loan purpose.
Refused Segment (Orange): Appears to have a lower proportion compared to approved,
which could indicate a lower number of refusals or a higher success rate of applications.
Unused Offer Segment (Gray): Consistently the smallest segment, implying that few
offers go unused.
Credit to Income Ratio Distribution (Histogram)
Low Ratios: The concentration of data points in the lower end of the ratio spectrum
suggests that most individuals in the dataset have manageable levels of credit relative to
their income.
High Ratios: The presence of outliers with higher ratios could be a concern for lenders, as
it may indicate a higher risk of default due to over-leveraging.
Risk Assessment: Individuals with higher credit-to-income ratios may require closer
examination to assess their ability to service their debt.
Income v/s Loan Amount by Loan Status (Scatter Plot)
Data Clustering: Most data points are clustered towards the lower end of both income and
credit amounts, suggesting that the majority of loans and incomes in the dataset are
relatively small.
Outliers: There is an outlier with a significantly higher income but a moderate loan
amount, marked in blue (approved), which could indicate a conservative borrowing
behavior or a high-income individual not maximizing their borrowing potential.
Loan Status Distribution: The distribution of different loan statuses can provide insights
into the lending practices and financial behaviors associated with different income levels.
Approved Loans: The prevalence of approved loans at lower income and credit amounts
may suggest a conservative lending approach for this dataset.
High-Income Borrowers: The outlier suggests that there may be high-income individuals
who are either less inclined to borrow or are being approved for lower amounts relative to
their income, which could be an area to explore for potential credit growth.
Pair Plot of Numerical Features
Correlations: By examining the scatter plots, we can identify whether there are any linear
relationships between the features. For example, a positive correlation would be indicated
by a collection of points rising together to the right.
Distributions: The histograms gives us a quick sense of the distribution of each variable,
such as whether it’s skewed, has a normal distribution, or contains outliers.
Data Density: The density of points in the scatter plots indicates the concentration of data
points and the variability within the dataset.
Correlation Heatmap
Implications
The results and observations derived from the Bank Loan Default Risk Analysis project play a
crucial role in addressing the problem statement of identifying patterns that indicate a client's
difficulty in paying their loan installments.
Loan Status Distribution: By analyzing the distribution of loan statuses, lenders can gain insights
into the prevalence of different loan outcomes such as approvals, cancellations, refusals, and
unused offers. This information helps in understanding the overall health of the loan portfolio
and identifying potential risk areas.
Income Distribution by Loan Status: Understanding the distribution of income across different
loan statuses provides lenders with valuable information about the financial profiles of
borrowers. This analysis aids in assessing the risk associated with varying income levels and
making informed decisions regarding loan approvals.
Loan Amount Distribution by Loan Status: Examining the distribution of loan amounts by loan
status helps in evaluating the effectiveness of the bank's credit policies and identifying potential
anomalies or risks associated with loan amounts. This analysis guides strategic decisions related
to risk assessment and loan management.
Default Rates by Loan Purpose: Analyzing default rates across different loan purposes allows
lenders to identify high-risk loan categories and tailor their lending practices accordingly. By
understanding the variations in default rates based on loan purposes, lenders can mitigate risks
associated with specific borrower profiles or loan purposes.
Number of Clients with Credit-to-Income Ratio > 0.5: Identifying clients with a high credit-to-
income ratio highlights individuals who may be at a higher risk of default due to over-leveraging.
This information assists lenders in assessing the financial stability of borrowers and making risk-
informed decisions.
Default Rates by Loan Purpose (Bar Graph): Visualizing default rates by loan purpose provides a
clear overview of the proportion of approved, canceled, refused, and unused offers across
different loan categories. This visualization aids in identifying trends and patterns that can guide
decision-making processes related to loan approvals and risk management.
Credit to Income Ratio Distribution (Histogram): The histogram depicting the distribution of
credit-to-income ratios helps in assessing the debt burden of borrowers. Lenders can use this
information to evaluate the financial health of clients and determine their capacity to service debt
effectively.
Pair Plot of Numerical Features and Correlation Heatmap:
Correlations and Distributions: By examining the pair plot and correlation heatmap, lenders can
identify relationships between numerical features and understand the correlations between
variables like loan amount, income, and annuity. These insights provide a deeper understanding
of the factors influencing loan default risks and aid in developing more accurate risk assessment
models.
In conclusion, the results and observations obtained from the analysis of borrower
characteristics, loan statuses, categorical features, and numerical correlations provide lenders
with valuable insights to identify patterns indicative of potential loan repayment difficulties. By
leveraging these insights, lenders can make informed decisions regarding loan approvals, risk
assessment, and mitigation strategies, ultimately helping them manage their loan portfolios
effectively and minimize losses due to loan defaults.
CONCLUSION
In conclusion, the Bank Loan Default Risk Analysis project, based on the comprehensive
methodology of Univariate Analysis, Borrower Characteristics Analysis, and Categorical
Features Analysis, has provided valuable insights into identifying patterns indicative of clients'
difficulties in repaying their loan installments. The results and observations derived from the
analysis offer significant implications for addressing the problem statement of predicting loan
default risks effectively.
The Univariate Analysis established baseline default rates, crucial for assessing overall risk and
identifying potential areas of concern within the dataset. The Borrower Characteristics Analysis
delved into income distributions, loan amounts, and other borrower attributes to uncover
correlations with loan default rates, aiding in risk assessment and decision-making. The
Categorical Features Analysis highlighted variations in default rates across different borrower
profiles and loan purposes, enabling lenders to tailor their strategies and mitigate risks.
By visualizing loan status distributions, borrower characteristics, and categorical features, the
project has laid a solid foundation for lenders to make informed decisions, such as denying loans
to high-risk applicants, adjusting loan amounts, or offering loans at higher interest rates. The
methodology's holistic approach, insightful visualizations, and data-driven analysis have
equipped lenders with the necessary tools to manage loan portfolios effectively and minimize
potential losses due to loan defaults.