0% found this document useful (0 votes)
19 views

Business_Report-Comp-Fin_Data_Part A_Problem

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Business_Report-Comp-Fin_Data_Part A_Problem

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

BUSINESS REPORT

FRA Project- Company_ Finance_ Data- Part A

JULY 21, 2024


Submitted By C. GNANAVEL
Agenda for the Report
1. Introduction
 Objectives of the analysis
2. About the Data
 Shape of the DATA
 Info about DATA
 Describe about DATA
3. Exploratory Data Analysis (EDA)
 Multivariate Analysis
4. DATA Preprocessing
 Outlier Detection
 Encoding
 Data Split
 Scaling
 Target Variable Creation
5. Model Building
 Evaluation Metrics
 Logistic Regression
 Random Forest
 Model Performance
6. Model Performance Improvement
 Multicollinearity
 Optimal Threshold
 Hyperparameter Tuning
 Model Performance Check
7. Model Performance Comparison and Final Model Selection
 Model Comparison
 Feature Importance
8. Actionable Insights & Recommendations
 Insights
 Recommendations
1. Introduction
Objective

The goal of this project is to predict whether an individual's net worth will be positive or
negative in the following year. The objective of this project is to perform a comprehensive
finance and risk analysis for the company using historical financial data. This analysis aims to
identify key financial indicators, predict potential financial risks, and provide actionable insights
to improve the company's financial stability and performance.

Dataset:
Company Finance Data: Company_Fin_Data.csv

2. About the Data


 Shape of the Data:
(4256, 51)

Number of Rows is 4256 Rows

Number of Columns is 51 Columns

 df.info() :
 Creating a binary target variable using 'Networth_Next_Year:
0 means No Default
1 means Default

 Top 10 Head Rows:

default Networth_Next_Year

0 0 395.30

1 0 36.20

2 0 84.00

3 0 2041.40

4 0 41.80

5 0 291.50

6 0 93.30

7 0 985.10

8 0 188.60

9 0 229.60

 No Default and Default Counts:


0 3352
1 904
Name: default, dtype: int64

3352 Customers are No Defaulters.


904 Customers are Defaulters.

Based on the above data 79% of customers are No Defaulters


and the Remaining 21% of customers are Defaulters
 The presence of duplicate rows : 0

 Number of Null Values in Column wise:

 No of Missing values in the Dataset is 1778


 Percentage of Missing Values in the Dataset is 8.03%

 Counting the Total Outliers by the Column

 After Remove the three columns 1. Num, 2. Networth_Next_Year, 3.


Equity_face_value the shape of the data is reduced to (4256, 48)
Plotting the Missing Value

Inference of Missing Values:


Variables with No Missing Values:
- Several variables have no missing values at all, such as Total_assets,
Net_worth, Total_income, Total_expenses, and others. These variables have a count of zero in
the "True" category (indicating missing values) and full bars in the "False" category (indicating
no missing values).

Variables with Some Missing Values:


- A significant number of variables have some missing values. Notable
examples include Change_in_stock, Cash_profit_as_perc_of_total_income, Other_income, and
others. These variables have bars in both the "False" and "True" categories, indicating the
presence of some missing values.

Variables with High Missing Values:


- Certain variables have a high count of missing values, such as
Income_from_financial_services, Deferred_tax_liability, Contingent_liabilities,
Cash_to_average_cost_of_sales_per_day, and a few others. These variables have substantial bars
in the "True" category.
Completely Missing Variables:
- No variables are completely missing, as none of the variables have a
count of one (indicating all values are missing).

 After Dropping the columns with more than 30% missing values , the shape of data
shape is restructured to (4256, 44).

Correlations between independent variable

Inference from the Correlation Heatmap


The correlation heatmap provided shows the pairwise correlation
coefficients between various financial metrics of a dataset. Here's a detailed analysis of the
insights that can be derived from this heatmap:
High Correlation Clusters:
There are noticeable clusters where certain groups of variables exhibit
high positive correlation (indicated by yellow or bright orange). These clusters suggest that
these variables tend to increase or decrease together.
Examples include: Total assets, net worth, and total income. Profit after
tax (PAT), cash profit, and PBDITA. Current assets, net working capital, and current ratio.

Negative Correlations:
There are some pairs of variables with negative correlations (indicated by
dark purple), suggesting an inverse relationship. For example:
Debt to equity ratio and net worth may have a negative correlation,
indicating that companies with higher debt relative to equity tend to have lower net worth.

Isolated Variables:
Some variables have low correlation with most other variables (indicated
by darker colors overall). These variables might not be directly related to the rest and can
provide unique information.

Highly Correlated Pairs:


The diagonal of the heatmap, as expected, shows perfect correlation
(correlation coefficient of 1) since it represents each variable correlated with itself.
Off-diagonal elements with bright yellow color indicate pairs of variables
with high positive correlation. These pairs could potentially be redundant if used together in a
model, as they provide similar information.

Potential Multicollinearity:
Variables with high correlation coefficients might lead to
multicollinearity in regression models. For instance, total assets, net worth, and total income are
highly correlated, which might pose multicollinearity issues if included together in a linear
regression model.

Initial VIF Calculation:


feature VIF
0 Total_assets inf
1 Net_worth 62.92
2 Total_income 162.37
3 Change_in_stock 1.51
4 Total_expenses 99.08
5 Profit_after_tax 34.81
6 PBDITA 21.16
7 PBT 36.89
8 Cash_profit 23.86
9 PBDITA_as_perc_of_total_income 4.04
10 PBT_as_perc_of_total_income 12.37
11 PAT_as_perc_of_total_income 10.43
12 Cash_profit_as_perc_of_total_income 5.65
13 PAT_as_perc_of_net_worth 2.16
14 Sales 99.87
15 Income_from_fincial_services 2.02
16 Total_capital 3.21
17 Reserves_and_funds 13.74
18 Borrowings 5.26
19 Current_liabilities_&_provisions 7.13
20 Shareholders_funds 62.60
21 Cumulative_retained_profits 8.25
22 Capital_employed 21.30
23 TOL_toTNW 3.32
24 Total_term_liabilities to_tangible_net_worth 4.33
25 Contingent_liabilities to_Net_worth_ perc 1.20
26 Net_fixed_assets 5.86
27 Current_assets 10.03
28 Net_working_capital 2.25
29 Quick_ratio_ times 3.22
30 Current_ratio_ times 2.76
31 Debt_to_equity_ratio_ times 6.15
32 Cash_to_current_liabilities_ times 2.14
33 Cash_to_average_cost_of_sales_per_day 1.85
34 Creditors_turnover 1.51
35 Debtors_turnover 1.52
36 Finished_goods_turnover 1.55
37 WIP_turnover 1.72
38 Raw_material_turnover 1.38
39 Shares_outstanding 3.14
40 EPS 7.75
41 Adjusted_EPS 7.02
42 Total_liabilities inf

Eliminate Columns with high VIF > 5

feature VIF

1 Profit_after_tax 4.73

9 Cumulative_retained_profits 4.46

13 Net_fixed_assets 4.37

8 Current_liabilities_&_provisions 3.59

7 Borrowings 3.37

15 Quick_ratio_ times 3.20

24 Shares_outstanding 3.05

6 Total_capital 3.04
feature VIF

16 Current_ratio_ times 2.73

3 PAT_as_perc_of_total_income 2.55

10 TOL_toTNW 2.37

11 Total_term_liabilities 2.24
to_tangible_net_worth

17 Cash_to_current_liabilities_ times 2.10

4 PAT_as_perc_of_net_worth 2.05

14 Net_working_capital 1.99

2 PBDITA_as_perc_of_total_income 1.94

5 Income_from_fincial_services 1.83

18 Cash_to_average_cost_of_sales_per_day 1.82

22 WIP_turnover 1.71

21 Finished_goods_turnover 1.54

20 Debtors_turnover 1.48

0 Change_in_stock 1.46

19 Creditors_turnover 1.45

23 Raw_material_turnover 1.37

25 Adjusted_EPS 1.30

12 Contingent_liabilities_to_Net_worth_ perc 1.19


Model performances with Logistic Regression and Random Forest

Metric Logistic Regression Random Forest


0 Accuracy 0.78 0.70
1 Precision 0.32 0.14
2 Recall 0.02 0.07
3 F1 Score 0.04 0.09
4 AUC-ROC 0.56 0.34

Insights from Logistic Regression and Random Forest

Accuracy:
Logistic Regression: 0.78
Random Forest: 0.70
Logistic Regression has a higher accuracy compared to Random Forest.
However, accuracy alone is not sufficient, especially if the dataset is imbalanced.

Precision:
Logistic Regression: 0.32
Random Forest: 0.14
Logistic Regression outperforms Random Forest in precision, meaning it
has fewer false positives relative to true positives.

Recall:
Logistic Regression: 0.02
Random Forest: 0.07
Random Forest has a higher recall compared to Logistic Regression,
indicating it identifies more true positives relative to false negatives.

F1 Score:
Logistic Regression: 0.04
Random Forest: 0.09
The F1 score for both models is quite low, indicating poor performance in
balancing precision and recall. Random Forest slightly outperforms Logistic Regression in this
regard.

AUC-ROC:
Logistic Regression: 0.56
Random Forest: 0.34
Logistic Regression has a higher AUC-ROC, suggesting it has better
discriminative ability than Random Forest.

Identify Optimal Threshold for Logistic Regression Using ROC


Curve

Output:
Optimal Threshold: 0.192776689616034

Logistic Regression (Optimal Threshold) –


Accuracy: 0.5536413469068129,
Precision: 0.2562396006655574, Recall: 0.555956678700361, F1 Score: 0.3507
9726651480636, AUC-ROC: 0.5557238267148015

Inference based on Optimal Threshold for Logistic Regression Using


ROC Curve
Based on the performance metrics for the Logistic Regression model using
the optimal threshold of 0.192776689616034, here are the insights:

Accuracy:
The model correctly predicts approximately 55.36% of the instances,
which is relatively low. This suggests that the model's ability to classify both classes correctly is
limited, but this is likely a reflection of optimizing for recall rather than overall accuracy.

Precision:
About 25.62% of the instances predicted as positive are true positives. This
relatively low precision indicates a high number of false positives.

Recall:
The model correctly identifies about 55.60% of the true positive
instances. This relatively high recall shows that the model is effective in identifying a substantial
proportion of positive cases, which is important in scenarios where missing a positive case is
costly.
F1 Score:
The F1 score of 0.3508 indicates a moderate balance between precision and
recall. Given the low precision, this score suggests that the model still performs reasonably well
in balancing false positives and false negatives.

AUC-ROC:
The AUC-ROC of 0.5557 is slightly above 0.5, indicating the model's
ability to distinguish between positive and negative classes is only marginally better than random
guessing. This is a critical metric as it reflects the overall discriminatory power of the model.

Insights Summary
- The model's accuracy and AUC-ROC are relatively low, indicating challenges in overall
classification performance.
- The high recall demonstrates the model's strength in identifying positive cases, which is
valuable in scenarios where detecting positives is crucial.
- The precision is relatively low, indicating that a significant proportion of positive predictions
are incorrect, leading to many false positives.
- The F1 score, while moderate, suggests the model maintains a reasonable balance between
precision and recall despite the skewed metrics.

Hyperparameter Tuning for Random Forest


Output:

Best Parameters for Random Forest: {'max_depth': 20, 'min_samples_l


eaf': 2, 'min_samples_split': 10, 'n_estimators': 100}

Random Forest (Best Model) - Accuracy: 0.7580266249021144, Precisio


n: 0.2714285714285714, Recall: 0.06859205776173286, F1 Score: 0.1095100864
5533142, AUC-ROC: 0.4014873646209386

Inference Hyperparameter Tuning for Random Forest

Accuracy:
The accuracy of the tuned Random Forest model is 0.7580, which is
reasonably high but slightly lower than the Logistic Regression model mentioned earlier.

Precision:
The precision is 0.2714, which is an improvement over the untuned
Random Forest model (0.14) but still indicates a considerable number of false positives.
Recall:
The recall is 0.0686, which is an improvement but still low, indicating that
the model is missing many true positive cases.

F1 Score:
The F1 score has improved to 0.1095, showing better balance between
precision and recall compared to the untuned model, but it remains relatively low.

AUC-ROC:
The AUC-ROC has improved to 0.4015 but is still low, indicating that the
model has limited ability to discriminate between positive and negative classes.

Model Performance Check Across Different Metrics


Metric Logistic Regression (Optimal Threshold)
0 Accuracy 0.55
1 Precision 0.26
2 Recall 0.56
3 F1 Score 0.35
4 AUC-ROC 0.56

Metric Random Forest (Best Model)


0 Accuracy 0.76
1 Precision 0.27
2 Recall 0.07
3 F1 Score 0.11
4 AUC-ROC 0.40

Insights and Recommendation

Insights
Accuracy:
Logistic Regression (Optimal Threshold): 0.55
Random Forest (Best Model): 0.76
Random Forest has a significantly higher accuracy compared to Logistic Regression.

Precision:
Logistic Regression (Optimal Threshold): 0.26
Random Forest (Best Model): 0.27
Both models have similar precision, with Random Forest being slightly better.
Recall:
Logistic Regression (Optimal Threshold): 0.56
Random Forest (Best Model): 0.07
Logistic Regression has a much higher recall, indicating it is better at identifying true
positives.

F1 Score:
Logistic Regression (Optimal Threshold): 0.35
Random Forest (Best Model): 0.11
Logistic Regression has a higher F1 score, suggesting a better balance between
precision and recall.

AUC-ROC:
Logistic Regression (Optimal Threshold): 0.56
Random Forest (Best Model): 0.40
Logistic Regression has a higher AUC-ROC, indicating better overall performance in
discriminating between classes.

Recommendations:

Model Selection:
Logistic Regression (with optimal threshold) seems to be the better choice in scenarios
where recall and balanced performance are crucial. Its higher recall and F1 score suggest it
performs better at identifying true positives and maintaining a balance between precision and
recall.
Random Forest might be preferred when accuracy is more important. Its higher accuracy
indicates it correctly predicts a higher number of overall cases, but its low recall and F1 score
suggest it struggles with imbalanced classes.

Threshold Adjustment:
Continue to optimize the decision threshold for both models. This can help in finding a
more suitable balance between precision and recall, especially for the Random Forest model.

Model Improvement:
For Logistic Regression:
Further tuning of the regularization parameter and feature engineering might help in
improving its performance.
For Random Forest:
Additional hyperparameter tuning and using class weights could help in improving recall
and F1 score.

Advanced Techniques:
Implement ensemble methods such as stacking or blending to combine the strengths of
both Logistic Regression and Random Forest, potentially leading to improved overall
performance.
Handling Class Imbalance:
Employ advanced resampling techniques like SMOTE or ADASYN to address class
imbalance, which can improve the recall and F1 score of the Random Forest model.

Feature Engineering:
Enhance the dataset by creating new features, removing irrelevant ones, or transforming
existing features to improve model performance for both Logistic Regression and Random Forest.

Conclusion
Based on the provided data:
Logistic Regression (Optimal Threshold) is recommended when the goal is
to maximize recall and achieve a balanced performance (higher F1 score and AUC-ROC).
Random Forest (Best Model) is recommended when higher accuracy is the
primary objective, though it requires improvements in recall and overall balanced performance.

You might also like