0% found this document useful (0 votes)
19 views6 pages

EasyChair Preprint 12726

Research paper of data science

Uploaded by

thusnevis.502160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

EasyChair Preprint 12726

Research paper of data science

Uploaded by

thusnevis.502160
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EasyChair Preprint

№ 12726

Credit EDA

Pavan Manikanta Bellam, Rohith Chinnamallappagari and


Manisha Chandramaully

EasyChair preprints are intended for rapid


dissemination of research results and are
integrated with the rest of EasyChair.

March 27, 2024


CREDIT EDA

BELLAM PAVAN MANIKANTA CHINNAMALLAPPAGARI ROHITH


Department of Computer Science Engineering Department of Computer Science Engineering
Parul Institute of Engineering and Technology Parul Institute of Engineering and Technology
Vadodara, India Vadodara, India
[email protected] [email protected]

MANISHA CHANDRAMAULLY
Assistant Professor,
Department of Computer Science & Engineering,
Parul Institute of Engineering and Technology
[email protected]

Abstract— This research paper delves into the realm of


credit analysis through an in-depth exploration of two distinct A. Problem Statement
datasets related to client loan applications. The first dataset This project aims to conduct Exploratory Data Analysis
encompasses a comprehensive array of client information (EDA) on loan applicant data to identify patterns and
recorded at the time of loan application, while the second mitigate financial risks for loan providing companies. By
dataset provides insights into the client's historical analysing applicant profiles, the goal is to differentiate
interactions with the loan application process. Our
between individuals capable of repayment and potential
methodology comprised separate analyses of each dataset,
followed by a meticulous integration process aimed at defaulters, ensuring sound loan approval decisions and
facilitating a holistic examination of credit-related trends and minimizing business losses.
patterns. By employing a diverse set of exploratory data
analysis techniques, including descriptive statistics, data B. Scope
visualization, and correlation analysis, we unearthed intricate The scope of this research project encompasses a
relationships within each dataset. This approach enabled us
to gain a nuanced understanding of client creditworthiness
comprehensive exploration of credit analysis using two
and the factors influencing loan approval outcomes. primary datasets: 'application_data.csv' and
Moreover, the amalgamation of findings from both datasets 'previous_application.csv'. These datasets serve as the
enriched our insights, revealing critical connections between cornerstone for understanding client loan applications and
application attributes and historical application outcomes. their historical interactions with the loan application
This paper contributes to the evolving landscape of credit process. The project primarily employs exploratory data
analysis by emphasizing the importance of leveraging diverse analysis (EDA) techniques, including descriptive statistics,
datasets for a comprehensive understanding of client credit data visualization, and correlation analysis, to unravel
profiles and enhancing decision-making processes in the intricate patterns, trends, and relationships embedded within
financial domain.
the datasets. A pivotal aspect of the project involves
Keywords—Exploratory Data Analysis (EDA), credit- integrating the two datasets to facilitate a holistic
based datasets, Credit analysis, loan approval, payment examination of credit-related phenomena. Through this
difficulties, client history, data integration, data visualization, integration, the project aims to enrich insights by elucidating
creditworthiness, financial decision-making. connections between application attributes and historical
application outcomes. The overarching goal is to generate
actionable insights into client creditworthiness and the
I. INTRODUCTION
factors influencing loan approval outcomes. While the
Credit assessment lies at the heart of financial decision- project strives to offer valuable insights, it acknowledges
making, influencing lending practices and risk management inherent limitations such as data availability, time
strategies. In this project, we delve into the realm of constraints, and computational resources. Despite these
Exploratory Data Analysis (EDA) applied to credit-based constraints, the project endeavours to provide a structured
datasets. These datasets encapsulate crucial information analysis of credit-based datasets, contributing to the
about clients' financial profiles and historical loan data, ongoing discourse surrounding credit assessment practices
offering insights into payment difficulties and loan approval and informing future research endeavours in the domain.
statuses. By combining and analyzing these datasets, we aim
to unravel patterns, trends, and relationships that inform C. Aim and Objective
creditworthiness and risk assessment. Through meticulous The aim of this project is to comprehensively analyze
preprocessing and analysis, our project endeavours to credit-related data to gain insights into client behaviors and
contribute to the understanding of credit analysis loan outcomes. Through meticulous examination of
methodologies, empowering stakeholders with actionable payment behaviors and loan histories, the project seeks to
insights for informed financial decision-making. identify patterns and trends within the data. By exploring
relationships between client characteristics and loan
outcomes, the research aims to contribute to the Exploration, Data Cleaning, and various visualization
enhancement of credit assessment methodologies and techniques like histograms, scatterplots, and boxplots to
decision-making processes in financial contexts. Through uncover patterns, trends, and relationships within the data.
these objectives, the project endeavors to provide valuable The following are the steps I have followed to perform this
insights that can inform and improve credit risk assessment project.
practices, ultimately facilitating more informed financial
A. Dataset and Data Exploration
decision-making.
Dataset and Data Exploration involves examining the
II. LITERATURE SURVEY structure and content of a dataset before analysis. It
Matthieu Komorowski [1] has described about the most includes tasks such as checking the dimensions of the
common tools available for exploring a dataset, which is dataset, identifying variable types, and understanding
essential in order to gain a good understanding of the distributions of data through summary statistics and
features and potential issues of a dataset, as well as helping visualizations.
in hypothesis generation. B. Data cleaning
Jitendra Pramanik [2] emphasizes the importance of data Data Cleaning is the process of identifying and
analysis in making informed decisions, citing examples like correcting errors, inconsistencies, or missing values in a
recommendation systems and product purchase predictions. dataset to improve its quality and reliability for analysis. It
Exploratory Data Analysis (EDA) using Python, with includes tasks such as removing duplicates, imputing
libraries like pandas and matplotlib, was employed to missing values, and identifying and handling outliers.
interpret Amazon electronic item review datasets. Python's
object-oriented, interpreted, and interactive nature
C. Binning values
facilitated comprehensive analysis of the data.
Binning Values is a data preprocessing technique used
V.P. Sumathi [3] explores the surge in loan applications to group continuous or categorical data into discrete
in India and the challenges banks face in predicting intervals or bins. This simplifies the data and can help
repayment capabilities. Utilizing exploratory data analysis, identify patterns or trends that may not be apparent when
the paper reveals a preference for short-term loans, often analyzing individual values. Binning values can be useful
sought for debt consolidation. Graphical representations for reducing the complexity of datasets, creating
assist bankers in comprehending client behavior for categorical variables from continuous data, or preparing
informed decision-making. data for analysis techniques that require discrete inputs.
Sudhamathy G. [4] focuses on mitigating bank loan risks D. Univariate Analysis
through data mining techniques. The paper proposes a
Univariate Analysis focuses on analyzing and
decision tree-based model to assess loan default
summarizing the characteristics of a single variable in a
probabilities, utilizing pre-processed datasets for efficient
dataset. It involves examining the distribution of values,
predictions. Experimental results validate the model's
calculating descriptive statistics, and visualizing data with
efficacy in risk management strategies.
plots like histograms, distribution plots, boxplots to
Rory M. Leith [5] utilizes exploratory data analysis to understand its behavior in isolation.
detect trends and statistical characteristics in nine • Histogram:
streamflow time series, presenting results graphically and A histogram is a graphical representation of
through relevant statistical tests. The approach not only the distribution of numerical data, divided into
identifies trends but also contextualizes responses against bins or intervals. It displays the frequency or count
observed values, revealing periods of unusual flow of observations falling within each bin, allowing
conditions and non-normal behaviours in flow sequences. for the visualization of data distribution and
identifying patterns such as skewness or peaks.
Patricia Jimbo Santana [6] conducts a comparative
analysis between optimization-based methods initialized • Distplot:
with neural networks and partition algorithms based on trees A distplot, or distribution plot, is a graphical
for extracting credit risk rules. Results from real databases representation of the distribution of a univariate
reveal that the former yields rules with reduced cardinality dataset. It combines a histogram with a kernel
and acceptable classification precision, making it desirable density estimate (KDE) plot, showing the
for financial institutions making face-to-face credit approval frequency distribution of data along with an
decisions. This approach enables easier training of bank estimated probability density function. Distplots
employees in selecting optimal customers, facilitating retail help visualize the central tendency, spread, and
customer interactions. shape of the distribution of data.
• Boxplot:
A boxplot, or box-and-whisker plot, is a
III. METHODOLOGY
graphical summary of the distribution of
Exploratory Data Analysis numerical data through quartiles. It displays the
median, quartiles, and range of the dataset, as well
Exploratory Data Analysis (EDA) involves initial steps as identifying outliers. Boxplots are useful for
in data analysis to understand the main characteristics of a
dataset. It includes processes such as Dataset and Data
comparing the distributions of different variables offers functionalities for data cleaning, visualization, and
or groups within a dataset. storage, streamlining the entire data analysis process.
Leveraging the underlying power of the NumPy package,
E. Bivariate Analysis
Pandas enhances data processing efficiency and
Bivariate Analysis involves analyzing the relationship performance. Additionally, Pandas seamlessly integrates
between two variables in a dataset. It includes examining with plotting functions from Matplotlib and Seaborn,
correlations, associations, or dependencies between enabling users to create insightful visualizations to further
variables using statistical measures and visualizations like analyze and interpret data trends effectively.
scatterplots, pair plots to understand their interactions. Jupyter Notebook is a powerful tool for executing code
• Scatterplot: in a cell-by-cell manner, offering a convenient console-
A scatterplot is a graphical representation of based computing approach. Its web-based application
data points plotted on a Cartesian plane, with one process allows for seamless interaction with code,
variable on the x-axis and another on the y-axis. It facilitating input and output of computations in an intuitive
shows the relationship between two variables, manner. Additionally, Jupyter Notebook provides rich
allowing for visual examination of patterns, media representations of objects, enhancing the visual
trends, and correlations. Scatterplots are useful for presentation of data analysis results.
identifying relationships and outliers in datasets.
• Pair plot: V. WORKING ON THE DATASETS
A pair plot is a grid of scatterplots and histograms It's time to investigate and learn more about the data.
that allows for the visualization of pairwise The information we are utilizing is from the credit-based
relationships between multiple variables in a dataset. It datasets where the primary datasets are
displays scatterplots for numerical variables and 'application_data.csv' which contains all the information of
histograms for distributions along the diagonal. Pair the client at the time of application and it is about whether
plots are useful for identifying patterns and a client has payment difficulties.
correlations between variables. and 'previous_application.csv' contains information about
F. Multivariate Analysis the client’s previous loan data. It contains the data whether
the previous application had been Approved, Cancelled,
Multivariate Analysis involves exploring relationships Refused or Unused offer. We will examine the data and
among three or more variables simultaneously. It extends consider our alternatives.
beyond bivariate analysis to uncover complex patterns and 1. Now here we import all the necessary libraries like
interactions within the data. Techniques such as heatmap pandas, NumPy, matplotlib and seaborn.
visualization, multiple regression, principal component 2. Later we have imported one the dataset which is a csv
analysis (PCA), and cluster analysis are commonly used in file named as application_data.csv which contains
multivariate analysis. 307511 rows and 122 columns as data frame df.
• Heatmap: 3. Later we have imported one the dataset which is a csv
A heatmap is a graphical representation of data file named as application_data.csv which contains
where values in a matrix are represented as colors. It is 307511 rows and 122 columns as data frame df. We
often used to visualize correlations or relationships have used head( ) method to get top 5 rows of the data
between variables in a dataset, with higher values frame.
indicated by warmer colors (e.g., red) and lower values 4. Next, we have explored the dataset with info() method
by cooler colors (e.g., blue). Heatmaps help identify provides information about the data frame, including
patterns and trends in complex datasets. the number of rows, column names, data types, and
IV. EDA IN PYTHON memory usage and describe() method computes
summary statistics for numerical columns such as
Python's ease of use, large library, and strong data count, mean, standard deviation, minimum, 25th
handling make it a popular choice for exploratory data percentile, median (50th percentile), 75th percentile,
analysis (EDA). Python is an open-source language that is and maximum.
available on several platforms and provides adaptability 5. Next, we started to check and handle null values.
and compatibility with third-party tools. Its primarily, we drop columns with more than 47% of
comprehensibility and readability enable developers to null values and then named the new data frame as
comprehend and alter code effectively. Python's abundance app_df which contains columns with null percentage
of libraries makes for smooth visualization, which helps to less than 47%. we achieved this using dropna()
produce reports that are both understandable and method.
informative. With Python's robust capabilities and easy-to- 6. Next, we find the null percentage of each column in
use interface, analysts can quickly and easily gain insight app_df using isnull().mean()*100 method to handle
into data using EDA tasks. the remaining columns with null values.
Pandas is a leading package for data analysis, renowned 7. Next filled the missing values with mean, median,
for its versatility and robust capabilities. With Pandas, users mode or with new values based on the type of the
can efficiently clean, transform, and analyze datasets, column using a method fillna().
facilitating seamless data manipulation tasks. It supports 8. Next, we aimed to understand the unique values in a
various data formats, including CSV, enabling easy storage column of the dataset, along with their corresponding
and retrieval of data on computers. Moreover, Pandas percentages in the distribution of column values. We
used the `value_counts(normalize=True) * 100` AMT_ANNUITY, AMT_GOODS_PRICE to
method for this purpose. understand the relationship among these variables.
9. Next, we used the boxplot to visualize the distribution 17. Next we performed multivariate analysis on
of values in a column. AMT_INCOME_TOTAL, AMT_CREDIT,
10. Following that, we introduced several new columns AMT_ANNUITY, AMT_GOODS_PRICE,
derived from binning certain columns within the YEARS_BIRTH, YEARS_EMPLOYED,
'app_df' data frame. Initially, we utilized a lambda YEARS_REGISTRATION, YEARS_ID_PUBLISH,
function to ensure that the data remained non-negative YEARS_LAST_PHONE_CHANGE by generating
and absolute. Subsequently, employing the 'cut()' correlation matrix using corr() method and plotting
method, we categorized the column data into distinct heatmap of the correlation matrix.
intervals, with these categories serving as the values 18. Next we performed all the above steps on
for the newly created columns. "previous_data.csv" to analysis this dataset in the same
11. Following the creation of the newly formed columns, way by importing data file as papp_df data frame.
we utilized bar graphs and pie charts to visually 19. Next merged both the data frames based on the
represent each column's distribution. This approach common column SK_ID_CURR using the method
allowed us to gain insights into the proportion of data merge() method. And used head(), info(), shape(0
within each category. method to know about the merged data frame.
12. Next, we executed this 20. Here, we have made the pivot table on required
"app_df.TARGET.value_counts(normalize=True)*10 columns using pivot_table() method and plot the pivot
0" to calculate the percentage distribution of values in table using heat map.
the "TARGET" column of the Data Frame "app_df". It
provides the normalized counts of each unique value VI. RESULT
in the "TARGET" column, expressed as percentages. Here are some of the results and conclusions that we
13. Next, In the dataset there is a column named as had made as the result.
TARGET which has only two values 0 and 1. where, 1
indicates client with payment difficulties: he/she had
late payment more than X days on at least one of the
first Y installments of the loan in our sample, 0
indicates all other cases. Now we are dividing the
app_df data frames into two data frames tar_0 with all
the data where TARGET column with value 0 and
tar_1 with all the data where TARGET column with
value 1.
14. Next we performed univariate analysis by initially
dividing the whole columns into categorical columns
and numerical columns and used
`value_counts(normalize=True) method to understand
the unique values in a column of the categorical values,
along with their corresponding percentages in the
distribution of column values and plotted the
categorical columns with pie chart. Now, with
numerical columns we used describe() method which Based on the above graph the conclusions we have made
computes summary statistics for numerical columns are:
such as count, mean, standard deviation, minimum, • The Bank operates between 10am to 3pm except
25th percentile, median (50th percentile), 75th for Saturday and Sunday Its between 10am to
percentile, and maximum and plotted them using 2pm.
boxplot to understand the distribution of values in that • We can observe that around 11:30am to 12pm
column. And also plotted distplot to get visual around 50% of Customers visit the branch for loan
comparison of the distribution of a column between application on all the days except for Saturday
two groups, 'tar_0' and 'tar_1', representing clients with where the time is between 10am to llam for both
and without payment difficulties, respectively. Target 0 and 1
15. Next, we conducted bivariate analysis on two sets of • The loan defaulters have applied for the loan
variables: 'WEEKDAY_APPR_PROCESS_START' between 9:30am-10am and 2pm where as the
and 'HOUR_APPR_PROCESS_START' for the 'tar_0' applicants who repay the loan on time have
and 'tar_1' groups, and 'AGE_CATEGORY' and applied for the loan between loam to 3pm
'AMT_CREDIT' for the same groups. The first set of
boxplots compares the timing of loan applications
across weekdays for both groups, while the second set
compares the amount of credit granted across different
age categories.
16. Next, we have plotted pairplot on
AMT_INCOME_TOTAL, AMT_CREDIT,
1. Applicants with income type maternity leave and client
type new are having more chances of getting the loan
approved
2. Applicants with income type maternity leave,
unemployed and client type Repeater are having getting the
loan cancelled
3. Applicants with income type maternity leave,
Unemployed and client type Repeater are having getting
the loan refused
4. Applicants with income type maternity leave and client
type Repeater, Working and client type New are not able to
utilize the Bank's offer.
VII. CONCLUSION
In this study, we utilized Exploratory Data Analysis (EDA)
methods to decipher critical insights within credit datasets.
Conducted within the Jupyter Notebook environment using
Python, alongside essential libraries such as NumPy,
Pandas, Matplotlib, and Seaborn, our analysis delved into
the intricacies of client financial data. Looking ahead, we
plan to expand our investigation by incorporating
additional datasets and leveraging advanced analytical
Based on the above graph the conclusions we have made techniques to gain deeper insights into exploratory data
are: analysis methodologies.
1. AMT_INCOME_TOTAL - It is less correlated
with AMT_CREDIT, AMT_ANNUITY, REFERENCES
AMT_GOODS_PRICE respectively [1] Matthieu Komorowski, Dominic C. Marshall, Justin
2. AMT_CREDIT - It has a strong positive D. Salciccioli and Yves Crutain, 2016, Exploratory
correlation index of 0.98, 0.75 with Data Analysis
AMT_GOODS_PRICE, AMT_ANNUITY [2] Kabita Sahoo, Abhaya Kumar Samal, Jitendra
respectively and also positive correlation With Pramanik, Subhendu Kumar Pani, 2019, Exploratory
Data Analysis using Python
other Year Columns [3] X.Francis Jency, V.P.Sumathi, Janani Shiva Sri, 2018,
3. AMT_ANNUITY - It has positive correlation An Exploratory Data Analysis for Loan Prediction
Index ot 0.75 with AMT_CREDIT and Based on Nature of the Clients
AMT_GOODS_PRICE and Negative With [4] Sudhamathy G., 2016, Credit Risk Analysis and
YEAR_EMPLOYED and Prediction Modelling of Bank Loans Using R
YEAR_REGISTRATION [5] Rory M. Leith, Keith W. Hipel & Herman Goertz,
2013, EXPLORATORY DATA ANALYSIS
4. AMT_GOODS_PRICE - It has a strong positive
[6] Patricia Jimbo Santana, Augusto Villa Monte, Enzo
correlation index 0.98 with AMT_ANNUITY and Rucci Laura Lanzarini, Aurelio F. Bariviera, 2016, An
AMT_CREDIT and weak positive correlation exploratory analysis of methods for extracting credit
with other Year columns. risk rules
[7] Exploratory data analysis – From Wikipedia, the free
encyclopedia [Online], Available:
https://en.wikipedia.org/wiki/Exploratory_data_analy
sis
[8] Exploratory Data Analysis in Python – From Geeks for
Geeks [online], Available:
https://www.geeksforgeeks.org/exploratory-data-
analysis-in-python/

Based on the above graph the conclusions we have made


are:

You might also like