100% found this document useful (1 vote)

238 views21 pages

FRA Business Report

The document summarizes the analysis of a dataset containing financial information of companies from 2015-2016 to predict whether a company would default. Key steps included: outlier treatment, imputation of missing values, transforming the target variable to binary, feature selection and correlation analysis. A logistic regression model was built on important variables using recursive feature elimination. The model achieved 95% accuracy on both training and test sets, though recall was lower, indicating some defaulters may be missed. Overall the model performed reasonably well given data quality issues.

Uploaded by

Surabhi Kulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

238 views21 pages

FRA Business Report

Uploaded by

Surabhi Kulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

FRA Milestone-1 - Report

Surabhi Kulkarni

PGP-DSBA Online

TABLE OF CONTENTS

1. Problem Statement

2. Summary of Data

3. Outlier Treatments

4. Missing Value Treatment

5. Transform Target Variable to 0 and 1

6. MultiVariate Analysis

7. Train Test Split

8. Logistic Regression Models

9. Performance Metrics of All Models & Interpretations

List of Figures

Fig – Outlier
Fig – Boxplot
Fig – heatmap
Fig – distplot
Fig – countplot
Fig –Scatterplot

Problem Statement
Businesses or companies can fall prey to default if they are not
able to keep up their debt obligations. Defaults will lead to a lower
credit rating for the company which in turn reduces its chances of
getting credit in the future and may have to pay higher interests
on existing debts as well as any new obligations. From an
investor's point of view, he would want to invest in a company if it
is capable of handling its financial obligations, can grow quickly,
and is able to manage the growth scale.
A balance sheet is a financial statement of a company that
provides a snapshot of what a company owns, owes, and the
amount invested by the shareholders. Thus, it is an important tool
that helps evaluate the performance of a business.
Data that is available includes information from the financial
statement of the companies for the previous year (2015). Also,
information about the Networth of the company in the following
year (2016) is provided which can be used to drive the labeled
field.
Importing Libraries.
Importing Data.

Checking the type of the dataset.

Checking the shape of the dataset: (3586, 67)

Getting the info data types column wise.

dtypes: float64(63), int64(3), object(1)
memory usage: 1.8+ MB

Observation-1:

The data set contains 3586 row, 67 columns .

In the given data set there are 3 Integer type features, 63 Float type
features. 1 Object type features.

Performing EDA

EDA-Step 1: Checking for duplicate records in the data

Number of duplicate rows = 0

Target Variable –

- We create a target variable - ‘default’

- Where, if Net-worth next year is zero or positive —> default = 0

- If Net-worth next year is negative —> default = 1

Co_Code 291
Networth_Next_Year 676
Equity_Paid_Up 448
Networth 650
Capital_Employed 596
...
Creditors_Velocity_Days 391
Inventory_Velocity_Days 262
Value_of_OutputtoTotal_Assets 150
Value_of_OutputtoGross_Block 481
default 388
Length: 67, dtype: int64

Number of missing values after replacing

outliers with Nan values is 42828

The data set contains 3586 row, 67 columns .

Given the fact that this is a financial data and the outliers might very
well reflect the information which is genuine in nature. Since there is
data captured for small, medium as well as large companies.

1.2 Missing Value Treatment

Visualizing Missing Values:

presence of missing values in some variables can be observed.Blue
color in the heatmap is indicating occupied cells while red cuolor
indicates missing values present in the data.Listing down few
observations:

No more missing values were present after treatment.

Q1.3. : Transform Target variable into 0 and 1 :

A new dependent variable named "Default" was created based on the

criteria given in the project notes.

Criteria 1 - If the Net Worth Next Year is negative for the company 0 - If
the Net Worth Next Year is positive for the company

Made use of np.where function to achieve this.

Creating a binary target variable using 'Networth_Next_Year'.

After generating the dependent column, we checked for the split of

data based on this dependent variable. Below is a bar plot showing the
same.
Distinct values of the dependent variable – 0 and 1.
0 3271
1 315

Q1.4. : Univariate & Bivariate analysis with proper interpretation : (You

may choose to include only those variables which were significant in
the model building)

We could see all the important features contributing to the model seem
to be having a lot of outliers.

We also have values both in positive and negative range, which is for
most of the variables. Univariate Analysis :

Boxplot has been created for the numerical variables which have
importance w.r.t. features in the dataset.
Distribution of column with Displot & Box plot:
Bivariate Analysis

Gross Sales Vs Net Sales:

There exists linear relationship between these two important variables.

Networth Vs Capital Employment:

As the capital increases, net worth also increases, but in some cases,
capital seems to be disbursed even for lesser networth.
Networth Vs Cost of Production
Multi-variate Analysis:

 We also performed multi variate analsysis on the data to see if there

are any correlation that are observed within the data.

 Correlations function was used and seaborn clustermap was used to

plot the correlations and to make better sense of the data.

 We observed that networth and networth next year were highly

correlated. Apart from this,  We also found various Rate of Growth
variables were highly correlated.

 This analysis tells us that there is a problem of collinearity with this

data set.

Heatmap has been plotted as follows :

: Train Test Split :

 We are splitting the data set as df_1 (data which has independent
variables) and df_2 (data which has the predictor variable)
 We performed the splitting of training and testing sets in the ratio
of 67: 33 and then we try to the fit the model into the testing and
training sets and find out the performance of those sets.
 Seed value of 42 was used
Q 1.6. : Build Logistic Regression Model (using statsmodel library) on
most important variables on Train Dataset and choose the optimum
cutoff. Also showcase your model building approach.

For model building, we try to approach recursive feature elimination

and we want to select top 15 features that would contribute to the
model well.

We give weightage to each variable and based on the weightage;

rankings are provided.

For modeling we will use Logistic Regression will recursive feature

elimination.

Applying GridSearchCV for Logistic Regression :

grid_search.best_params_ and grid_search.best_estimator_ are as

follows :
{'penalty': 'none', 'solver': 'lbfgs', 'tol':
0.0001}

LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none')
Q1.7. : Validate the Model on Test Dataset and
state the performance matrices. Also state
interpretation from the model.

We train the model and then validate the model

in both the training and testing sets.

1
0

0 1.00 0.00
We are plotting the confusion matrix and classification
1 0.97 0.03 report for both sets.

2 0.99 We could see high precision and accuracy, but the recall
0.01

seems to be less in the training data. We need to improve

3 0.73 0.27
the recall value as that would give us True Positives (TP),
4 1.00 0.00 which in turn means that , we will correctly identify the
defaulters accurately, because if we miss a defaulter, that
would account to the bank paying higher interests to the existing debts
and cash flow will not be regularized in the bank.

Confusion matrix and Classification Report for the training set:

[[2165 26]
[ 86 125]]

precision recall f1-score support

0 0.96 0.99 0.97 2191

1 0.83 0.59 0.69 211
accuracy 0.95 2402
macro avg 0.89 0.79 0.83 2402
weighted avg 0.95 0.95 0.95 2402

Confusion matrix and Classification Report for the test set :

We could see high precision and accuracy, but the recall seems to be
less in the testing set.
[[1062 18]
[ 43 61]]

precision recall f1-score support

0 0.96 0.98 0.97 1080

1 0.77 0.59 0.67 104

accuracy 0.95 1184

macro avg 0.87 0.78 0.82 1184
weighted avg 0.94 0.95 0.95 1184

In [ ]:
Finally, we are able to achieve a descent recall value without
overfitting. Considering the opportunities such as outliers, missing
values and correlated features this is a fairly good model. It can be
improved if we get better quality data where the features explaining
the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.

MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Project Report
100% (3)
Project Report
36 pages
Project Questions
No ratings yet
Project Questions
3 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Answer Book - Rose Wines
100% (1)
Answer Book - Rose Wines
11 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
100% (1)
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
14 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
FRA Report
100% (1)
FRA Report
30 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Problem 2
100% (1)
Problem 2
10 pages
Enactive Conference Proceedings
No ratings yet
Enactive Conference Proceedings
251 pages
Entrepreneurial Process
100% (1)
Entrepreneurial Process
12 pages
Test Bank Statistics For The Behavioral Sciences 9th Ed CH 2
100% (2)
Test Bank Statistics For The Behavioral Sciences 9th Ed CH 2
16 pages
PAUT Training Brochure
No ratings yet
PAUT Training Brochure
1 page
Final Research 13
No ratings yet
Final Research 13
20 pages
Introduction To IOAA
No ratings yet
Introduction To IOAA
14 pages
Freelance Content Writing Ebook
No ratings yet
Freelance Content Writing Ebook
32 pages
Five Star Health Safety Audit Factsheet
No ratings yet
Five Star Health Safety Audit Factsheet
1 page
The Effects of Store Environment On Shopping Behaviors
100% (2)
The Effects of Store Environment On Shopping Behaviors
8 pages
Exploring The Impact of Artificial Intelligence On Business Operations
No ratings yet
Exploring The Impact of Artificial Intelligence On Business Operations
7 pages
Machine Learnin1
100% (1)
Machine Learnin1
41 pages
PAASCU FILES Teachers As Curriculum Leaders
No ratings yet
PAASCU FILES Teachers As Curriculum Leaders
28 pages
NSSCO - Geography Paper 3 6137-3 - First Proof 08.04.2022
No ratings yet
NSSCO - Geography Paper 3 6137-3 - First Proof 08.04.2022
16 pages
30605677
No ratings yet
30605677
372 pages
Fundamentals of Risk Management - Understanding, Evaluating and Implementing Effective Risk Management (PDFDrive)
100% (1)
Fundamentals of Risk Management - Understanding, Evaluating and Implementing Effective Risk Management (PDFDrive)
9 pages
FINAL - BT 21223132 BP 11123132 pd3 ph4 2025
No ratings yet
FINAL - BT 21223132 BP 11123132 pd3 ph4 2025
6 pages
El 196112 Wrightstone PDF
No ratings yet
El 196112 Wrightstone PDF
6 pages
Khadiza Rahman
No ratings yet
Khadiza Rahman
279 pages
GP 4424
No ratings yet
GP 4424
36 pages
Machine Learning: Pradyumn Sharma Pragati Software Pvt. LTD
No ratings yet
Machine Learning: Pradyumn Sharma Pragati Software Pvt. LTD
85 pages
Design Brief PDF
No ratings yet
Design Brief PDF
2 pages
What Is A P Value
No ratings yet
What Is A P Value
4 pages
H2IOSC Deliverables 4.13 Draft
No ratings yet
H2IOSC Deliverables 4.13 Draft
11 pages
Cognitive Psychology
No ratings yet
Cognitive Psychology
10 pages
Technology Life Cycle
No ratings yet
Technology Life Cycle
6 pages
Behaviour of Reinforced Concrete Beams With Coconut Shell As Coarse Aggregates
No ratings yet
Behaviour of Reinforced Concrete Beams With Coconut Shell As Coarse Aggregates
7 pages
Research On Dark Tourism
No ratings yet
Research On Dark Tourism
2 pages
Synopsis
No ratings yet
Synopsis
12 pages
Research Defense
No ratings yet
Research Defense
18 pages
RRL Suggestion
No ratings yet
RRL Suggestion
3 pages
Neural Style Transfer
No ratings yet
Neural Style Transfer
14 pages
Computer Vision Assignment
No ratings yet
Computer Vision Assignment
1 page
M3 T1 V3 Joins Query
No ratings yet
M3 T1 V3 Joins Query
1 page
TPS (Think Pair Share) REPORT: Syed Ayub Ahmed DSBA Online Date:15/03/2021
No ratings yet
TPS (Think Pair Share) REPORT: Syed Ayub Ahmed DSBA Online Date:15/03/2021
8 pages
Shubham Tripathi CV PDF
No ratings yet
Shubham Tripathi CV PDF
3 pages
Children Conceptualizing Their Capabilities - Results From A Survey...
No ratings yet
Children Conceptualizing Their Capabilities - Results From A Survey...
26 pages

FRA Business Report

Uploaded by

FRA Business Report

Uploaded by

FRA Milestone-1 - Report

4. Missing Value Treatment

5. Transform Target Variable to 0 and 1

7. Train Test Split

8. Logistic Regression Models

9. Performance Metrics of All Models & Interpretations

Checking the type of the dataset.

Checking the shape of the dataset: (3586, 67)

Getting the info data types column wise.

The data set contains 3586 row, 67 columns .

EDA-Step 1: Checking for duplicate records in the data

Number of duplicate rows = 0

- We create a target variable - ‘default’

- Where, if Net-worth next year is zero or positive —> default = 0

- If Net-worth next year is negative —> default = 1

Number of missing values after replacing

The data set contains 3586 row, 67 columns .

1.2 Missing Value Treatment

Visualizing Missing Values:

No more missing values were present after treatment.

A new dependent variable named "Default" was created based on the

Made use of np.where function to achieve this.

Creating a binary target variable using 'Networth_Next_Year'.

After generating the dependent column, we checked for the split of

Q1.4. : Univariate & Bivariate analysis with proper interpretation : (You

Gross Sales Vs Net Sales:

There exists linear relationship between these two important variables.

 We also performed multi variate analsysis on the data to see if there

 Correlations function was used and seaborn clustermap was used to

 We observed that networth and networth next year were highly

 This analysis tells us that there is a problem of collinearity with this

Heatmap has been plotted as follows :

For model building, we try to approach recursive feature elimination

We give weightage to each variable and based on the weightage;

For modeling we will use Logistic Regression will recursive feature

Applying GridSearchCV for Logistic Regression :

grid_search.best_params_ and grid_search.best_estimator_ are as

We train the model and then validate the model

seems to be less in the training data. We need to improve

Confusion matrix and Classification Report for the training set:

precision recall f1-score support

0 0.96 0.99 0.97 2191

Confusion matrix and Classification Report for the test set :

precision recall f1-score support

0 0.96 0.98 0.97 1080

accuracy 0.95 1184

You might also like