Business_Report-Comp-Fin_Data_Part A_Problem
Business_Report-Comp-Fin_Data_Part A_Problem
The goal of this project is to predict whether an individual's net worth will be positive or
negative in the following year. The objective of this project is to perform a comprehensive
finance and risk analysis for the company using historical financial data. This analysis aims to
identify key financial indicators, predict potential financial risks, and provide actionable insights
to improve the company's financial stability and performance.
Dataset:
Company Finance Data: Company_Fin_Data.csv
df.info() :
Creating a binary target variable using 'Networth_Next_Year:
0 means No Default
1 means Default
default Networth_Next_Year
0 0 395.30
1 0 36.20
2 0 84.00
3 0 2041.40
4 0 41.80
5 0 291.50
6 0 93.30
7 0 985.10
8 0 188.60
9 0 229.60
After Dropping the columns with more than 30% missing values , the shape of data
shape is restructured to (4256, 44).
Negative Correlations:
There are some pairs of variables with negative correlations (indicated by
dark purple), suggesting an inverse relationship. For example:
Debt to equity ratio and net worth may have a negative correlation,
indicating that companies with higher debt relative to equity tend to have lower net worth.
Isolated Variables:
Some variables have low correlation with most other variables (indicated
by darker colors overall). These variables might not be directly related to the rest and can
provide unique information.
Potential Multicollinearity:
Variables with high correlation coefficients might lead to
multicollinearity in regression models. For instance, total assets, net worth, and total income are
highly correlated, which might pose multicollinearity issues if included together in a linear
regression model.
feature VIF
1 Profit_after_tax 4.73
9 Cumulative_retained_profits 4.46
13 Net_fixed_assets 4.37
8 Current_liabilities_&_provisions 3.59
7 Borrowings 3.37
24 Shares_outstanding 3.05
6 Total_capital 3.04
feature VIF
3 PAT_as_perc_of_total_income 2.55
10 TOL_toTNW 2.37
11 Total_term_liabilities 2.24
to_tangible_net_worth
4 PAT_as_perc_of_net_worth 2.05
14 Net_working_capital 1.99
2 PBDITA_as_perc_of_total_income 1.94
5 Income_from_fincial_services 1.83
18 Cash_to_average_cost_of_sales_per_day 1.82
22 WIP_turnover 1.71
21 Finished_goods_turnover 1.54
20 Debtors_turnover 1.48
0 Change_in_stock 1.46
19 Creditors_turnover 1.45
23 Raw_material_turnover 1.37
25 Adjusted_EPS 1.30
Accuracy:
Logistic Regression: 0.78
Random Forest: 0.70
Logistic Regression has a higher accuracy compared to Random Forest.
However, accuracy alone is not sufficient, especially if the dataset is imbalanced.
Precision:
Logistic Regression: 0.32
Random Forest: 0.14
Logistic Regression outperforms Random Forest in precision, meaning it
has fewer false positives relative to true positives.
Recall:
Logistic Regression: 0.02
Random Forest: 0.07
Random Forest has a higher recall compared to Logistic Regression,
indicating it identifies more true positives relative to false negatives.
F1 Score:
Logistic Regression: 0.04
Random Forest: 0.09
The F1 score for both models is quite low, indicating poor performance in
balancing precision and recall. Random Forest slightly outperforms Logistic Regression in this
regard.
AUC-ROC:
Logistic Regression: 0.56
Random Forest: 0.34
Logistic Regression has a higher AUC-ROC, suggesting it has better
discriminative ability than Random Forest.
Output:
Optimal Threshold: 0.192776689616034
Accuracy:
The model correctly predicts approximately 55.36% of the instances,
which is relatively low. This suggests that the model's ability to classify both classes correctly is
limited, but this is likely a reflection of optimizing for recall rather than overall accuracy.
Precision:
About 25.62% of the instances predicted as positive are true positives. This
relatively low precision indicates a high number of false positives.
Recall:
The model correctly identifies about 55.60% of the true positive
instances. This relatively high recall shows that the model is effective in identifying a substantial
proportion of positive cases, which is important in scenarios where missing a positive case is
costly.
F1 Score:
The F1 score of 0.3508 indicates a moderate balance between precision and
recall. Given the low precision, this score suggests that the model still performs reasonably well
in balancing false positives and false negatives.
AUC-ROC:
The AUC-ROC of 0.5557 is slightly above 0.5, indicating the model's
ability to distinguish between positive and negative classes is only marginally better than random
guessing. This is a critical metric as it reflects the overall discriminatory power of the model.
Insights Summary
- The model's accuracy and AUC-ROC are relatively low, indicating challenges in overall
classification performance.
- The high recall demonstrates the model's strength in identifying positive cases, which is
valuable in scenarios where detecting positives is crucial.
- The precision is relatively low, indicating that a significant proportion of positive predictions
are incorrect, leading to many false positives.
- The F1 score, while moderate, suggests the model maintains a reasonable balance between
precision and recall despite the skewed metrics.
Accuracy:
The accuracy of the tuned Random Forest model is 0.7580, which is
reasonably high but slightly lower than the Logistic Regression model mentioned earlier.
Precision:
The precision is 0.2714, which is an improvement over the untuned
Random Forest model (0.14) but still indicates a considerable number of false positives.
Recall:
The recall is 0.0686, which is an improvement but still low, indicating that
the model is missing many true positive cases.
F1 Score:
The F1 score has improved to 0.1095, showing better balance between
precision and recall compared to the untuned model, but it remains relatively low.
AUC-ROC:
The AUC-ROC has improved to 0.4015 but is still low, indicating that the
model has limited ability to discriminate between positive and negative classes.
Insights
Accuracy:
Logistic Regression (Optimal Threshold): 0.55
Random Forest (Best Model): 0.76
Random Forest has a significantly higher accuracy compared to Logistic Regression.
Precision:
Logistic Regression (Optimal Threshold): 0.26
Random Forest (Best Model): 0.27
Both models have similar precision, with Random Forest being slightly better.
Recall:
Logistic Regression (Optimal Threshold): 0.56
Random Forest (Best Model): 0.07
Logistic Regression has a much higher recall, indicating it is better at identifying true
positives.
F1 Score:
Logistic Regression (Optimal Threshold): 0.35
Random Forest (Best Model): 0.11
Logistic Regression has a higher F1 score, suggesting a better balance between
precision and recall.
AUC-ROC:
Logistic Regression (Optimal Threshold): 0.56
Random Forest (Best Model): 0.40
Logistic Regression has a higher AUC-ROC, indicating better overall performance in
discriminating between classes.
Recommendations:
Model Selection:
Logistic Regression (with optimal threshold) seems to be the better choice in scenarios
where recall and balanced performance are crucial. Its higher recall and F1 score suggest it
performs better at identifying true positives and maintaining a balance between precision and
recall.
Random Forest might be preferred when accuracy is more important. Its higher accuracy
indicates it correctly predicts a higher number of overall cases, but its low recall and F1 score
suggest it struggles with imbalanced classes.
Threshold Adjustment:
Continue to optimize the decision threshold for both models. This can help in finding a
more suitable balance between precision and recall, especially for the Random Forest model.
Model Improvement:
For Logistic Regression:
Further tuning of the regularization parameter and feature engineering might help in
improving its performance.
For Random Forest:
Additional hyperparameter tuning and using class weights could help in improving recall
and F1 score.
Advanced Techniques:
Implement ensemble methods such as stacking or blending to combine the strengths of
both Logistic Regression and Random Forest, potentially leading to improved overall
performance.
Handling Class Imbalance:
Employ advanced resampling techniques like SMOTE or ADASYN to address class
imbalance, which can improve the recall and F1 score of the Random Forest model.
Feature Engineering:
Enhance the dataset by creating new features, removing irrelevant ones, or transforming
existing features to improve model performance for both Logistic Regression and Random Forest.
Conclusion
Based on the provided data:
Logistic Regression (Optimal Threshold) is recommended when the goal is
to maximize recall and achieve a balanced performance (higher F1 score and AUC-ROC).
Random Forest (Best Model) is recommended when higher accuracy is the
primary objective, though it requires improvements in recall and overall balanced performance.