0% found this document useful (0 votes)
22 views24 pages

Trainity-Data An

The document outlines a data analytics project focused on analyzing loan applications to identify patterns related to customer payment difficulties and loan defaults. It details the exploratory data analysis (EDA) approach, including handling missing data, identifying outliers, analyzing data imbalance, and conducting various statistical analyses using Microsoft Excel. The project aims to optimize loan approval decisions while mitigating risks associated with defaults, providing insights into the demographics and behaviors of loan applicants.

Uploaded by

diariesdoodling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

Trainity-Data An

The document outlines a data analytics project focused on analyzing loan applications to identify patterns related to customer payment difficulties and loan defaults. It details the exploratory data analysis (EDA) approach, including handling missing data, identifying outliers, analyzing data imbalance, and conducting various statistical analyses using Microsoft Excel. The project aims to optimize loan approval decisions while mitigating risks associated with defaults, providing insights into the demographics and behaviors of loan applicants.

Uploaded by

diariesdoodling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|47013055

Trainity Data Analytics Training project 6

Data Analytics (Devi Ahilya Vishwavidyalaya)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Shubhanshi Bajpai ([email protected])
lOMoARcPSD|47013055

Trainity Data Analytics Training


Project 6
Bank Loan Case Study
Date: 13-12-2023 Arpit Paliwal
[email protected]

Project Description: Conduct Exploratory Data Analysis (EDA) as a data analyst at a finance
company specializing in lending loans to urban customers. The company faces a challenge of
customers with insufficient credit history exploiting the system and defaulting on loans. The goal
is to use EDA to analyze patterns in the data and ensure that qualified applicants are not
rejected.

The dataset includes information on loan applications, categorized into customers with payment
difficulties (late payments on installments) and those without payment issues. Four possible
outcomes of a loan application are Approved, Canceled, Refused, and Unused Offer.

The business objectives are to identify patterns indicating if a customer will struggle with
installment payments. This information can be used to make decisions such as denying loans,
reducing loan amounts, or lending at higher interest rates to risky applicants. The company aims
to understand key factors behind loan defaults for better decision-making in loan approval.

The context of risk analytics in banking and financial services is crucial to understanding the
project, including the significance of various variables in predicting and mitigating loan default
risks.

Approach: The focus is on mitigating default risks, particularly from customers with insufficient
credit history. The dataset comprises two categories: customers with payment difficulties and
those without. Four possible loan application outcomes exist: Approved, Canceled, Refused,
and Unused Offer.

The primary business objectives are to identify patterns that signal potential payment difficulties
and to comprehend the key factors influencing loan defaults. Through EDA, the aim is to
optimize decision-making in loan approval by avoiding rejections for capable applicants while
mitigating financial losses associated with defaults. A foundational understanding of risk
analytics in banking and financial services is recommended to navigate the significance of
variables in this context.

Tech Stack Used :Microsoft Excel 2007 as the principal tool. The project heavily relied on
Excel's extensive functions, adept data handling capabilities, and robust charting tools, playing a
pivotal role in both the analysis and reporting phases. The user-friendly interface of Excel
proved instrumental in seamlessly manipulating data and generating reports, thereby
significantly contributing to the successful evaluation of the data.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Data Analysis Tasks:


A.Identify Missing Data and Deal with it Appropriately: As a data analyst, you come across
missing data in the loan application dataset. It is essential to handle missing data effectively to
ensure the accuracy of the analysis.
Task: Identify the missing data in the dataset and decide on an appropriate method to deal with
it using Excel built-in functions and features.
Primary Data Set: application_data.csv

Explanation: To handle missing data I did the following:


1.calculated the percentage of blank cells in a new row (50001) using the function
=(COUNTBLANK()/COUNT())*100
2.With the help of conditional formatting identified and deleted all the columns which had a
percentage of missing cells more than 40%.
3. Filled all the missing cells with the median (row 50002) of that particular column (median as
mean will be ineffective because of outliers).

After the cleaning the data was left with 73 columns and 50002 rows with 0 blank cells and no
duplicates.

Graph:

Graph of column with missing values <40%

B. Identify Outliers in the Dataset: Outliers can significantly impact the analysis and distort the
results. You need to identify outliers in the loan application dataset.
Task: Detects and identifies outliers in the dataset using Excel statistical functions and features,
focusing on numerical variables.
Primary Data Set: application_data.csv
Explanation: To identify the outliers the quartile function was used as the following:
1.calculated the first and third quartile using the function =QUARTILE(ARRAY,1) and =
QUARTILE(ARRAY,3)
2. Calculated the inter quartile range(IQR) by subtracting the first quarter from the third quarter.
3. Calculated the lower and upper bound using the formula lower bound = Q1 - 1.5*IQR, upper
bound = Q3 + 1.5*IQR.
4. Another column was created to check if the values in the previous column lie between the
range of upper bound and lower bound which will be true and false if the value is an outlier.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Graphs: the scatter plots here are shown to visualize the outlier(took 15000 rows as excel was
freezing for a large number of rows).

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

(the identification of outliers is done on these crucial amount/income columns to find the unfit
candidates for the loan)
Although we can hardly find any outliers in the given dataset.

C. Analyze Data Imbalance: Data imbalance can affect the accuracy of the analysis, especially
for binary classification problems. Understanding the data distribution is crucial for building
reliable models.
Task: Determine if there is data imbalance in the loan application dataset and calculate the ratio
of data imbalance using Excel functions.
Primary Data: application_data

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Explanation: To check the data imbalance in the dataset I created different pivot tables for
columns in which the data imbalance was to be checked and generated column charts for each
of them in a different sheet.

Graphs:
Target:

Most of the people had paid installments in time comparatively few had difficulties.

Contract Type:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Higher number of cash loans among clients than revolving loans.

Gender:

Significantly higher number of female applicants than male applicants.

Owning Realty:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Majority of the applicants are realty owners

Count of Children:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Most of the applicants are childless meaning young and career focused applicants are in
majority

Organization Type:

Many of the applicants either have business entities or are self employed.

D. Perform Univariate, Segmented Univariate, and Bivariate Analysis: To gain insights into the
driving factors of loan default, it is important to conduct various analyses on consumer and loan
attributes.
Task: Perform univariate analysis to understand the distribution of individual variables,
segmented univariate analysis to compare variable distributions for different scenarios, and
bivariate analysis to explore relationships between variables and the target variable using Excel
functions and features.
Primary Data: application_data
Secondary Data: previous_data

Analysis of application_data :
Univariate analysis: In this type of analysis data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals with only one
quantity that changes. It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it.

Segmented univariate analysis: segmented univariate analysis is an extension of univariate


analysis as Segmented analysis here means that the data variable is analyzed in subsets(as
ranges).

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

I generated the frequency distribution histogram by creating classes (from max, min), bins and
using data analytics option>histogram>input range,output range, chart output.

The majority of applicants fall within the range of 25,650 to 275,650.

Loans are generally more prevalent in the lower credit range of 45,000 to 345,000, and as credit
scores increase, loan amounts tend to decrease.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

The majority of loans are obtained by individuals aged between 31 and 51 and the age bracket
of 21 to 61 exhibits a relatively even distribution of loan counts, suggesting a balanced
distribution.
There is a decline in the number of individuals taking loans as the age range increases.

Individuals in the working category are the most frequent borrowers, with commercial
associates following closely behind.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

People working for 0-10 years apply for most loans.

The highest number of loans is taken by married individuals, with singles coming in second.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

The majority of individuals reside in apartments, with a comparatively lower number choosing to
live with their parents.

Bivariate analysis: Bivariate analysis is one of the statistical analyses where two variables are
observed. One variable here is dependent while the other is independent. These variables are
usually denoted by X and Y. So, here we analyze the changes occurring between the two
variables and to what extent. Apart from bivariate, there are other two statistical analyses, which
are Univariate (for one variable) and Multivariate (for multiple variables).

Amount income & target:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

The highest incidence of defaults occurs among individuals with incomes in the range of 25,650
to 275,650. As income levels rise, both the number of loan applicants and the instances of
defaults decrease.

Amount credit:

The majority of loans are concentrated within the low credit range of 45,000 to 345,000, with a
notable occurrence of defaults. The highest default rates are observed in the credit range of

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

345,000 to 645,000. A decrease in both loan amounts and default occurrences is noted as credit
levels increase.

Age/Target:

▪ Age range of 31 to 51 ends up taking most loans and default.


▪ Age group of 21 to 61 tend to have similar roughly counts
indicating balanced distribution.
▪ As age range increases, people taking loan decreases as well
as them defaulting.

Count of children/target:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

People with 0 children default more followed by


people with 1 and 2 children(as no of people with 0 children are more in number).

Family Status/target:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Majority of defaulters are married followed by singles.

primary_data univariate and bivariate analysis:


After cleaning the data by deleting duplicate rows and blank rows and columns with more than
40% blank cells (as we did in application_data) the sheet is ready for analysis:
- 32 columns of 37 after deleting columns with mostly blank cells.
-
Univariate/Segmented Univariate analysis:
Amt_Goods_price:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

The majority of applicants seek loans for goods falling within the range of 0 to 2 lakh, while
there is a decrease in the number of people applying for goods with higher amounts.
Client type:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Most of the applicants are repeaters.

Payment Type:

Most applicants like to prefer cash through the bank and followed by XNA.

Bivariate analysis:
Contract Status/loan purpose:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Most XAP loans are getting approved while Most XNA loans are rejected.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Contract Status/client Type:

Repeat applicants have the highest approval rate, with new applications following closely
behind. Repeat applicants face nearly equal chances of either being canceled or refused.

Contract Status/contract Type:

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Consumer loans have a very low likelihood of being canceled. The highest proportion of cash
loans tends to be canceled.
Contract Status/Contract Type:

The majority of applicants have their loans approved through cash via the bank, and
cancellations are rare in this category. On the other hand, most applicants with the designation
XNA experience cancellations.

E. Identify Top Correlations for Different Scenarios: Understanding the correlation between
variables and the target variable can provide insights into strong indicators of loan default.
Task: Segment the dataset based on different scenarios (e.g., clients with payment difficulties
and all other cases) and identify the top correlations for each segmented data using Excel
functions.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Primary Dataset: application_data

Explanation: To calculate the correlation of different scenarios (scenarios with numeric data) i
copied columns with numeric data to a different sheet and calculated their correlation matrix
using data>data analytics>correlation.

Correlation matrix:

1. There is a robust correlation of 0.880 between CNT_CHILDREN and


CNT_FAM_MEMBERS, implying a strong association between the number of family
members and the number of children.
2. AMT_CREDIT and AMT_GOODS_PRICE exhibit a highly positive correlation of 0.987,
signifying a close relationship between them.
3. The positive correlation of 0.769 between AMT_ANNUITY and AMT_CREDIT suggests
that annuity is often linked to the loan amount.
4. AGE demonstrates a moderately negative correlation of -0.242 with YEAR_EMPLOYED,
indicating that older individuals tend to have fewer years of employment.
5. Negative correlation between REGION_RATING_CLIENT and
REGION_RATING_CLIENT_W_CITY -0.532 and -0.530 with
REGION_POPULATION_RALATIVE this indicates people living in higher populated regions
have lower ratings of their living regions.

Insights: Based on the analysis conducted on the provided data, several assumptions can be
inferred. The majority of loan applicants are individuals with zero children, real estate ownership,
and ownership of a business entity. Additionally, a significant portion of the applicants are
female, and the majority falls within the income range of 25,650 to 275,650. Furthermore,
individuals with a working income, having 0-10 years of employment, and being married are
more likely to make timely payments associated with the loan.

Result: Engaging in this comprehensive project proved beneficial in gaining a deeper


understanding of various Excel functionalities. The exploration of concepts such as histograms,
correlation coefficients, and both univariate and bivariate analysis enhanced comprehension of
statistical principles. Handling a larger and more complex dataset contributed to an improved
approach to solving data analysis problems overall.

Downloaded by Shubhanshi Bajpai ([email protected])


lOMoARcPSD|47013055

Drive Links:
Application_data:
https://docs.google.com/spreadsheets/d/1a2dkSdjpqA1yosCl-q_HBArZ80VCkmhc/edit?usp=dr
ive_link&ouid=103303747027981242683&rtpof=true&sd=true
previous_data:
https://docs.google.com/spreadsheets/d/1PXPDoJynWRnDlQR2bpBmMwUNwVax_n9Y/edit?u
sp=drive_link&ouid=103303747027981242683&rtpof=true&sd=true

Downloaded by Shubhanshi Bajpai ([email protected])

You might also like