Trainity-Data An
Trainity-Data An
Project Description: Conduct Exploratory Data Analysis (EDA) as a data analyst at a finance
company specializing in lending loans to urban customers. The company faces a challenge of
customers with insufficient credit history exploiting the system and defaulting on loans. The goal
is to use EDA to analyze patterns in the data and ensure that qualified applicants are not
rejected.
The dataset includes information on loan applications, categorized into customers with payment
difficulties (late payments on installments) and those without payment issues. Four possible
outcomes of a loan application are Approved, Canceled, Refused, and Unused Offer.
The business objectives are to identify patterns indicating if a customer will struggle with
installment payments. This information can be used to make decisions such as denying loans,
reducing loan amounts, or lending at higher interest rates to risky applicants. The company aims
to understand key factors behind loan defaults for better decision-making in loan approval.
The context of risk analytics in banking and financial services is crucial to understanding the
project, including the significance of various variables in predicting and mitigating loan default
risks.
Approach: The focus is on mitigating default risks, particularly from customers with insufficient
credit history. The dataset comprises two categories: customers with payment difficulties and
those without. Four possible loan application outcomes exist: Approved, Canceled, Refused,
and Unused Offer.
The primary business objectives are to identify patterns that signal potential payment difficulties
and to comprehend the key factors influencing loan defaults. Through EDA, the aim is to
optimize decision-making in loan approval by avoiding rejections for capable applicants while
mitigating financial losses associated with defaults. A foundational understanding of risk
analytics in banking and financial services is recommended to navigate the significance of
variables in this context.
Tech Stack Used :Microsoft Excel 2007 as the principal tool. The project heavily relied on
Excel's extensive functions, adept data handling capabilities, and robust charting tools, playing a
pivotal role in both the analysis and reporting phases. The user-friendly interface of Excel
proved instrumental in seamlessly manipulating data and generating reports, thereby
significantly contributing to the successful evaluation of the data.
After the cleaning the data was left with 73 columns and 50002 rows with 0 blank cells and no
duplicates.
Graph:
B. Identify Outliers in the Dataset: Outliers can significantly impact the analysis and distort the
results. You need to identify outliers in the loan application dataset.
Task: Detects and identifies outliers in the dataset using Excel statistical functions and features,
focusing on numerical variables.
Primary Data Set: application_data.csv
Explanation: To identify the outliers the quartile function was used as the following:
1.calculated the first and third quartile using the function =QUARTILE(ARRAY,1) and =
QUARTILE(ARRAY,3)
2. Calculated the inter quartile range(IQR) by subtracting the first quarter from the third quarter.
3. Calculated the lower and upper bound using the formula lower bound = Q1 - 1.5*IQR, upper
bound = Q3 + 1.5*IQR.
4. Another column was created to check if the values in the previous column lie between the
range of upper bound and lower bound which will be true and false if the value is an outlier.
Graphs: the scatter plots here are shown to visualize the outlier(took 15000 rows as excel was
freezing for a large number of rows).
(the identification of outliers is done on these crucial amount/income columns to find the unfit
candidates for the loan)
Although we can hardly find any outliers in the given dataset.
C. Analyze Data Imbalance: Data imbalance can affect the accuracy of the analysis, especially
for binary classification problems. Understanding the data distribution is crucial for building
reliable models.
Task: Determine if there is data imbalance in the loan application dataset and calculate the ratio
of data imbalance using Excel functions.
Primary Data: application_data
Explanation: To check the data imbalance in the dataset I created different pivot tables for
columns in which the data imbalance was to be checked and generated column charts for each
of them in a different sheet.
Graphs:
Target:
Most of the people had paid installments in time comparatively few had difficulties.
Contract Type:
Gender:
Owning Realty:
Count of Children:
Most of the applicants are childless meaning young and career focused applicants are in
majority
Organization Type:
Many of the applicants either have business entities or are self employed.
D. Perform Univariate, Segmented Univariate, and Bivariate Analysis: To gain insights into the
driving factors of loan default, it is important to conduct various analyses on consumer and loan
attributes.
Task: Perform univariate analysis to understand the distribution of individual variables,
segmented univariate analysis to compare variable distributions for different scenarios, and
bivariate analysis to explore relationships between variables and the target variable using Excel
functions and features.
Primary Data: application_data
Secondary Data: previous_data
Analysis of application_data :
Univariate analysis: In this type of analysis data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals with only one
quantity that changes. It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it.
I generated the frequency distribution histogram by creating classes (from max, min), bins and
using data analytics option>histogram>input range,output range, chart output.
Loans are generally more prevalent in the lower credit range of 45,000 to 345,000, and as credit
scores increase, loan amounts tend to decrease.
The majority of loans are obtained by individuals aged between 31 and 51 and the age bracket
of 21 to 61 exhibits a relatively even distribution of loan counts, suggesting a balanced
distribution.
There is a decline in the number of individuals taking loans as the age range increases.
Individuals in the working category are the most frequent borrowers, with commercial
associates following closely behind.
The highest number of loans is taken by married individuals, with singles coming in second.
The majority of individuals reside in apartments, with a comparatively lower number choosing to
live with their parents.
Bivariate analysis: Bivariate analysis is one of the statistical analyses where two variables are
observed. One variable here is dependent while the other is independent. These variables are
usually denoted by X and Y. So, here we analyze the changes occurring between the two
variables and to what extent. Apart from bivariate, there are other two statistical analyses, which
are Univariate (for one variable) and Multivariate (for multiple variables).
The highest incidence of defaults occurs among individuals with incomes in the range of 25,650
to 275,650. As income levels rise, both the number of loan applicants and the instances of
defaults decrease.
Amount credit:
The majority of loans are concentrated within the low credit range of 45,000 to 345,000, with a
notable occurrence of defaults. The highest default rates are observed in the credit range of
345,000 to 645,000. A decrease in both loan amounts and default occurrences is noted as credit
levels increase.
Age/Target:
Count of children/target:
Family Status/target:
The majority of applicants seek loans for goods falling within the range of 0 to 2 lakh, while
there is a decrease in the number of people applying for goods with higher amounts.
Client type:
Payment Type:
Most applicants like to prefer cash through the bank and followed by XNA.
Bivariate analysis:
Contract Status/loan purpose:
Most XAP loans are getting approved while Most XNA loans are rejected.
Repeat applicants have the highest approval rate, with new applications following closely
behind. Repeat applicants face nearly equal chances of either being canceled or refused.
Consumer loans have a very low likelihood of being canceled. The highest proportion of cash
loans tends to be canceled.
Contract Status/Contract Type:
The majority of applicants have their loans approved through cash via the bank, and
cancellations are rare in this category. On the other hand, most applicants with the designation
XNA experience cancellations.
E. Identify Top Correlations for Different Scenarios: Understanding the correlation between
variables and the target variable can provide insights into strong indicators of loan default.
Task: Segment the dataset based on different scenarios (e.g., clients with payment difficulties
and all other cases) and identify the top correlations for each segmented data using Excel
functions.
Explanation: To calculate the correlation of different scenarios (scenarios with numeric data) i
copied columns with numeric data to a different sheet and calculated their correlation matrix
using data>data analytics>correlation.
Correlation matrix:
Insights: Based on the analysis conducted on the provided data, several assumptions can be
inferred. The majority of loan applicants are individuals with zero children, real estate ownership,
and ownership of a business entity. Additionally, a significant portion of the applicants are
female, and the majority falls within the income range of 25,650 to 275,650. Furthermore,
individuals with a working income, having 0-10 years of employment, and being married are
more likely to make timely payments associated with the loan.
Drive Links:
Application_data:
https://docs.google.com/spreadsheets/d/1a2dkSdjpqA1yosCl-q_HBArZ80VCkmhc/edit?usp=dr
ive_link&ouid=103303747027981242683&rtpof=true&sd=true
previous_data:
https://docs.google.com/spreadsheets/d/1PXPDoJynWRnDlQR2bpBmMwUNwVax_n9Y/edit?u
sp=drive_link&ouid=103303747027981242683&rtpof=true&sd=true