FRA Business Report
FRA Business Report
FRA Business Report
Surabhi Kulkarni
PGP-DSBA Online
TABLE OF CONTENTS
1. Problem Statement
2. Summary of Data
3. Outlier Treatments
6. MultiVariate Analysis
Fig – Outlier
Fig – Boxplot
Fig – heatmap
Fig – distplot
Fig – countplot
Fig –Scatterplot
Problem Statement
Businesses or companies can fall prey to default if they are not
able to keep up their debt obligations. Defaults will lead to a lower
credit rating for the company which in turn reduces its chances of
getting credit in the future and may have to pay higher interests
on existing debts as well as any new obligations. From an
investor's point of view, he would want to invest in a company if it
is capable of handling its financial obligations, can grow quickly,
and is able to manage the growth scale.
A balance sheet is a financial statement of a company that
provides a snapshot of what a company owns, owes, and the
amount invested by the shareholders. Thus, it is an important tool
that helps evaluate the performance of a business.
Data that is available includes information from the financial
statement of the companies for the previous year (2015). Also,
information about the Networth of the company in the following
year (2016) is provided which can be used to drive the labeled
field.
Importing Libraries.
Importing Data.
Observation-1:
In the given data set there are 3 Integer type features, 63 Float type
features. 1 Object type features.
Performing EDA
Target Variable –
Given the fact that this is a financial data and the outliers might very
well reflect the information which is genuine in nature. Since there is
data captured for small, medium as well as large companies.
Criteria 1 - If the Net Worth Next Year is negative for the company 0 - If
the Net Worth Next Year is positive for the company
We could see all the important features contributing to the model seem
to be having a lot of outliers.
We also have values both in positive and negative range, which is for
most of the variables. Univariate Analysis :
Boxplot has been created for the numerical variables which have
importance w.r.t. features in the dataset.
Distribution of column with Displot & Box plot:
Bivariate Analysis
As the capital increases, net worth also increases, but in some cases,
capital seems to be disbursed even for lesser networth.
Networth Vs Cost of Production
Multi-variate Analysis:
We are splitting the data set as df_1 (data which has independent
variables) and df_2 (data which has the predictor variable)
We performed the splitting of training and testing sets in the ratio
of 67: 33 and then we try to the fit the model into the testing and
training sets and find out the performance of those sets.
Seed value of 42 was used
Q 1.6. : Build Logistic Regression Model (using statsmodel library) on
most important variables on Train Dataset and choose the optimum
cutoff. Also showcase your model building approach.
LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none')
Q1.7. : Validate the Model on Test Dataset and
state the performance matrices. Also state
interpretation from the model.
1
0
0 1.00 0.00
We are plotting the confusion matrix and classification
1 0.97 0.03 report for both sets.
2 0.99 We could see high precision and accuracy, but the recall
0.01
[[2165 26]
[ 86 125]]
We could see high precision and accuracy, but the recall seems to be
less in the testing set.
[[1062 18]
[ 43 61]]
In [ ]:
Finally, we are able to achieve a descent recall value without
overfitting. Considering the opportunities such as outliers, missing
values and correlated features this is a fairly good model. It can be
improved if we get better quality data where the features explaining
the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.