0% found this document useful (0 votes)
153 views25 pages

Financial Risk Analysis: Great Learning PGPBABI 2017

This document describes a financial risk analysis project that aims to build a predictive model using logistic regression to classify companies as healthy or financially distressed based on historical financial data. The model addresses limitations in past research by focusing on precision for the minority class. Feature selection and sampling techniques were required due to missing data, many variables, and class imbalance. The results and lessons learned could help identify companies at risk of default.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views25 pages

Financial Risk Analysis: Great Learning PGPBABI 2017

This document describes a financial risk analysis project that aims to build a predictive model using logistic regression to classify companies as healthy or financially distressed based on historical financial data. The model addresses limitations in past research by focusing on precision for the minority class. Feature selection and sampling techniques were required due to missing data, many variables, and class imbalance. The results and lessons learned could help identify companies at risk of default.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Great Learning PGPBABI 2017

Financial Risk
Analysis
Group Assignment
Financial Risk Analysis

Contents

Team ....................................................................................................................................................................... 3

Abbreviations.......................................................................................................................................................... 3

Introduction ............................................................................................................................................................ 4

Scope and Objectives ......................................................................................................................................... 4

Data Source ........................................................................................................................................................ 4

Statistical Tools and Platforms ........................................................................................................................... 4

Limitations and Challenges ................................................................................................................................. 4

Understanding the Data ......................................................................................................................................... 5

Data Description ................................................................................................................................................. 5

Pre-processing .................................................................................................................................................... 7

Business Insights to Exploratory Analysis ........................................................................................................... 8

Model and Methodology ........................................................................................................................................ 9

Feature Selection ................................................................................................................................................ 9

Sampling ........................................................................................................................................................... 10

Machine Learning ............................................................................................................................................. 12

Recall, Precision and F Measure ....................................................................................................................... 12

ROC Curve ......................................................................................................................................................... 12

Model Performance and Validation ................................................................................................................. 13

Conclusion and Lessons Learnt ............................................................................................................................. 15

Lessons Learnt .................................................................................................................................................. 15

Bibliography .......................................................................................................................................................... 16

Appendix 1: Dataset Exploratory Analysis ............................................................................................................ 18

Appendix 2: Model Architecture Diagram ............................................................................................................ 21

Appendix 3: Performance of Models .................................................................................................................... 23

Page | 1
Financial Risk Analysis

List of Figures

Figure 1: Correlation Matrix ................................................................................................................................... 9

Figure 2: Performance & Validation ..................................................................................................................... 14

Figure 3: Model Architecture Diagram - KNIME ................................................................................................... 21

Figure 4: Logistic Regression after SMOTE ........................................................................................................... 23

Figure 5: Logistic Regression After Random Under Sampling .............................................................................. 24

List of Tables

Table 1: Structure of Raw Dataset .......................................................................................................................... 5

Table 2: Response Variable Summary in Training Dataset ..................................................................................... 7

Table 3: Missing Values .......................................................................................................................................... 7

Table 4: Selected Variables after Feature Selection ............................................................................................. 10

Table 5: Data Imbalance on Training Dataset ....................................................................................................... 11

Table 6: Data Imbalance Treatment Summary ..................................................................................................... 11

Table 8: Summary of Metrics for Logistic Regression ........................................................................................... 13

Table 9: Dataset Statistics ..................................................................................................................................... 18

Table 10: Outlier Treatment Summary ................................................................................................................. 19

Page | 2
Financial Risk Analysis

Team

Abbreviations

ML Machine Learning

SMOTE Synthetic Minority Oversampling

RUS Random Under Sampling

TP True Positive

TN True Negative

FP False Positive

FN False Negative

ROC Receiving Operating Characteristic

AUC Area Under Curve

AUROC Area Under Receiver Operating Characteristic Curve

Page | 3
Financial Risk Analysis

Introduction

This assignment presents a financial distress prediction model that uses logistic regression to classify between
healthy and financially distressed firms. Past research has largely ignored the precision of the minority group
and focused on the overall accuracy percentage. We present a model linking microeconomic variables directly
to an aggregate measure of credit risk in organisations and the suggested method serves as an early warning
system for firms in financial distress.

Keywords: Machine Learning, Feature Selection, Predictive Modelling, Logistic Regression, Recall, F-measure,
AUROC, SMOTE, KNIME, Financial Risk Analysis

Scope and Objectives

The scope of this assignment is limited to the analysis of the available data and predict a classification outcome
of the categorical dependent variable (default) using logistic regression. The objective is to better predict the
likelihood of default by an organisation, as well as identify the key drivers that determine this likelihood to
explore the data provided in the dataset and use logistic regression to build a predictive model and screen
distressed companies from healthy ones.

Data Source

The dataset is a historical input data Indian companies listed on the Bombay Stock Exchange. For the
protection of personal information, key information and data which can identify an organisation has been
anonymized.

Statistical Tools and Platforms

Multiple platforms were utilized at various stages of the assignment life cycle, including KNIME. MS Excel was
used for preliminary statistical data exploration. Logistic Regression was used for prediction.

Limitations and Challenges

The data had many variables contained 51 predictor variables, with lots of missing values that needed
treatment and was highly unbalanced in nature and contained a data imbalance problem. This required
feature selection and artificial treatment through sampling techniques.

Page | 4
Financial Risk Analysis

Understanding the Data

The proposed model in this assignment consists of data pre-processing, treatment of outliers, feature
selection, sampling and application of regression algorithm based on machine learning.

Data Description

The dataset is a historical financial data of organisations and includes 4256 observations. It has 50 variables
including the response variable. The data set is not labelled at the outset. However, it is implied that Net worth
of next if greater than 0 is non-default, with label 0 referring to healthy companies and 1 referring to
companies with impending financial risk. The structure of the data is as below. The data was at the outset split
into a raw training data set (3541 observations) and a raw validation dataset (715 observations).

T ABLE 1: STRUCTURE OF RAW DATASET


Variable Name Description Expected Sign Type of Data
with Y Variable Type
Net worth Next Year Response or Y variable - Response Numeric
Total assets Size Neg Predictor Numeric
Net worth Size Neg Predictor Numeric
Total income Profitability Neg Predictor Numeric
Change in stock Not Clear - could be change is tock recommended Predictor Numeric
prince / inventory not to be used
Total expenses Cost Pos Predictor Numeric
Profit after tax Profitability Neg Predictor Numeric
PBDITA Profitability Neg Predictor Numeric
PBT Profitability Neg Predictor Numeric
Cash profit Profitability Neg Predictor Numeric
PBDITA as % of total Profitability - Ratio Neg Predictor Numeric
income
PBT as % of total income Profitability - Ratio Neg Predictor Numeric
PAT as % of total income Profitability - Ratio Neg Predictor Numeric
Cash profit as % of total Profitability - Ratio Neg Predictor Numeric
income
PAT as % of net worth Profitability - Ratio Neg Predictor Numeric
Sales Size Measure Neg Predictor Numeric
Income from financial Profitability Neg Predictor Numeric
services
Other income Profitability Neg Predictor Numeric
Total capital Size Neg Predictor Numeric
Reserves and funds Profit/Liquidity Neg Predictor Numeric
Deposits (accepted by Liquidity Neg Predictor -
commercial banks)
Borrowings Total Debt / Borrowings / Leverage Pos Predictor Numeric
Current liabilities & Current Liability / Short Term Debt Pos Predictor Numeric
provisions

Page | 5
Financial Risk Analysis

Deferred tax liability Liability to be cleared Next Year Pos Predictor Numeric
Shareholders’ funds Equity - Size Measure Neg Predictor Numeric
Cumulative retained Equity - Size Measure Neg Predictor Numeric
profits
Capital employed Capital - Size Measure Neg Predictor Numeric
TOL/TNW Leverage - Ratio Pos Predictor Numeric
Total term liabilities / Leverage - Ratio Pos Predictor Numeric
tangible net worth
Contingent liabilities / Net Short term Leverage - Ratio Pos Predictor Numeric
worth (%)
Contingent liabilities Debt Pos Predictor Numeric
Net fixed assets Size Neg Predictor Numeric
Investments Size Neg Predictor Numeric
Current assets Size - short term Neg Predictor Numeric
Net working capital Size Neg Predictor Numeric
Quick ratio (times) Liquidity Neg Predictor Numeric
Current ratio (times) Liquidity Neg Predictor Numeric
Debt to equity ratio Leverage Pos Predictor Numeric
(times)
Cash to current liabilities Liquidity Neg Predictor Numeric
(times)
Cash to average cost of Liquidity Neg Predictor Numeric
sales per day
Creditors turnover Liquidity (rate at which company pays Neg Predictor Numeric
its creditors - higer speed -lowe
likelihood to def
Debtors turnover Liquidity - how effective company Neg Predictor Numeric
extending credit and recovering
collectible - how efficiently firm uses
asset. Higher value lower likelihood to
def
Finished goods turnover Liquidity - High Ratio - quick good Neg Predictor Numeric
clearance - lower likelihood to def
WIP turnover Liquidity Neg Predictor Numeric
Raw material turnover Liquidity - higher the value - lower Neg Predictor Numeric
number of days raw materials stayed
in inventory - lower is def likelihood
Shares outstanding Size Neg Predictor Numeric
Equity face value Size Neg Predictor Numeric
EPS Profitability - Market Expectation Neg Predictor Numeric
Adjusted EPS Profitability - Market Neg Predictor Numeric
Total liabilities Liabilities - Debt Neg Predictor Numeric
PE on BSE Market Expectation of the Company Neg Predictor Numeric
prospect

Page | 6
Financial Risk Analysis

T ABLE 2: RESPONSE VARIABLE SUMMARY IN T RAINING DATASET


No. of Defaults No. of Non-Defaults Data Imbalance No. of Variables
(Financial Risk) (Healthy)
243 3298 93.14% (non-defaults) 50

The dataset has been statistically analysed on each predictor variable to understand the quality prior to taking
pre-processing & cleansing related steps. The dataset summary indicated there were missing values that
needed to be addressed before conducting data manipulation & exploration techniques. For statistics table,
please refer to the Raw Training Data Statistics in Appendix 1: Dataset Exploratory Analysis.

T ABLE 3: MISSING VALUES


Variables Count of Variables Count of
Missing Missing
Values Values
Total income 198 Profit after tax 131
Change in stock 458 PBDITA 131
Total expenses 139 Net fixed assets 118
PBT 131 Investments 1435
Cash profit 131 Current assets 66
PBDITA as % of total income 68 Net working capital 32
PBT as % of total income 68 Quick ratio (times) 93
PAT as % of total income 68 Current ratio (times) 93
Cash profit as % of total income 68 Cash to current liabilities (times) 93
Sales 259 Cash to average cost of sales per day 85
Income from financial services 935 Creditors turnover 333
Other income 1295 Debtors turnover 328
Total capital 4 Finished goods turnover 740
Reserves and funds 85 WIP turnover 640
Borrowings 366 Raw material turnover 361
Current liabilities & provisions 96 Shares outstanding 692
Deferred tax liability 1140 Equity face value 692
Cumulative retained profits 38 PE on BSE 2194
Contingent liabilities 1188

Pre-processing

Once the raw training dataset was read, a missing value treatment was done on it. The missing values for each
of the predictor variables were replaced by their respective column median values.

Page | 7
Financial Risk Analysis

Outlier treatment was carried out on the dataset by considering the 1st percentile of each variable as the floor
and the 99th percentile as the cap. For the outlier’s treatment summary, please refer to the Outliers Summary
in Appendix 1: Dataset Exploratory Analysis.

Further the predictor variables were converted to ratios. The response variable (Default) was next converted
to string. Subsequent to ratio creation & string manipulation, the cleansed dataset was further explored on
correlation factors on individual predictors.

Business Insights to Exploratory Analysis

Proportion of Default ‘1’ is sparse in the Dataset, indicating that the given dataset is already biased towards
organisations compliant with a healthy (non-default) status.

By using a ratio between mean of variables values of Default and non-default, and by using a decision tree, it is
evident that some of the predictor variables are the most significant to the response variable (Default). These
variables are areas which the business may focus to reduce the risk of a financial crisis or increase credit
compliance.

Page | 8
Financial Risk Analysis

Model and Methodology

Feature Selection

Feature selection is a proven step which makes the data mining and machine learning efficient and effective.
The objectives of feature selection include building simpler and more comprehensible models, improving data
mining performance (predictive accuracy and comprehensibility), and preparing clean (redundancy and
irrelevancy removal), understandable data (Zhao & Liu, June 2007) (Baksai, 2010). Correlation checks of target
variable with individual predictor variables was carried out to ascertain significance. A correlation matrix, as
shown below, was generated. The colour range varies from dark red (strong negative correlation), over white
(no correlation) to dark blue (strong positive correlation). If a correlation value for a pair of columns is not
available, it is represented by X.

F IGURE 1: C ORRELATION MATRIX

Page | 9
Financial Risk Analysis

As observed from the graphs, there is high multi-collinearity observed in many variables. The Correlation Filter
node in KNIME uses the model as generated by a Correlation node to determine which columns are redundant
(i.e. correlated) and filters them out. The method applied here is known to be good approximation 1.

We started with 43 variables and determined the mean for each variable corresponding to the default and the
non-default response. Next a ratio was determined between the means (default and non-default). A
significance test was further carried out. Using this information and the based on the business domain
knowledge, we filtered out 22 variables.

T ABLE 4: SELECTED VARIABLES AFTER FEATURE SELECTION


Features Selected
TOL/TNW
Contingent liabilities / Net worth (%)
Quick ratio (times)
Adjusted EPS
Net worth Next Year/Total assets
Net worth/Total assets
Total income/Total assets
Change in stock/Total assets
PBDITA/total income
PAT/net worth
Other income/Total income
Total capital/Total assets
Reserves and funds/Total assets
Cumulative retained profits/Total income
Capital employed/Total assets
Net fixed assets/Total assets
Current assets/Total assets
Creditors turnover/Total assets
Debtors turnover/Total assets
Finished goods turnover/Total assets
Raw material turnover/Total assets

Sampling

A real financial dataset of default is unbalanced in nature and contains a data imbalance problem leading to
Sample selection bias. Learning from unbalanced datasets is a difficult task since most learning algorithms are

1
https://nodepit.com/node/org.knime.base.node.preproc.correlation.filter.CorrelationFilterNodeFactory

Page | 10
Financial Risk Analysis

not designed to cope with a large difference between the number of cases belonging to different classes
(Batista, et al., 2000). There are several methods that deal with this problem and we can distinguish between
methods that operate at the data and algorithmic levels (Chawla, et al., 2004). Previous research has proposed
a random minority over-sampling approach and cluster-based under-sampling approach for selecting the
representative data as training data to improve the classification accuracy for minority class (Yen & Lee, 2009).

T ABLE 5: DATA I MBALANCE ON TRAINING DATASET


Response Count Percentage
Non-Default 3298 93.14%
Default 243 6.86%
Total 3541

In this assignment, we have explored the use of both Synthetic Minority Over-sampling Technique (SMOTE)
and Random Under Sampling (Equal Size sampling node in KNIME platform) technique to overcome this
problem and create a more efficient experiment.

SMOTE is an over-sampling technique that uses a method of generating arbitrary examples, rather than simply
oversampling through duplication or replacement (Han, et al., 2005).

The Equal Size sampling node in KNIME platform removes rows from the input data set such that the values in
a categorical column are equally distributed. This is done to downsize the data set so that the class attributes
occur equally often in the data set. The node will remove random rows belonging to the majority classes. The
rows returned by this node will contain all records from the minority class(es) and a random sample from each
of the majority classes, whereby each sample contains as many objects as the minority class contains 2.

We explored modelling in multiple ways and test them for efficiency. The training dataset was first split in a
80:20 ratio into a training and test dataset respectively. The training dataset was further treated for imbalance
using both Over-sampling (SMOTE) and Random Under Sampling (Equal Size Sampling) techniques. Please see
the Data Imbalance Treatment Summary.

T ABLE 6: DATA I MBALANCE T REATMENT S UMMARY


Response Count Percentage Count after Percentage Count after Percentage
SMOTE RUS
Non-Default 3298 93.14% 2637 50.00% 216 52.6%
Default 243 6.86% 2637 50.00% 195 47.4%
Total 3541 5274 411

2
https://nodepit.com/node/org.knime.base.node.preproc.equalsizesampling.EqualSizeSamplingNodeFactory

Page | 11
Financial Risk Analysis

Machine Learning

Machine Learning plays a vital role in default detection which is data driven. In this assignment, we have
focussed on supervised learning where the algorithm is trained on labelled examples. Supervised learning
assumes the availability of labelled samples, i.e. observations annotated with their output, which can be used
to train a learner. In the training set we can distinguish between input features and an output variable that is
assumed to be dependent on the inputs. The output, or response variable, defines the class of observations
and the input features are the set of variables that have some influence on the output and are used to predict
the value of the response variable.

As required, logistic regression algorithms were applied for accurate classification of default data in the
transaction dataset. Logistic regression is a statistical technique that is used to predict the likelihood of an
event using a linear combination of independent variables as a probability model. The objective of logistic
regression is to provide a general regression in the relationship between the dependent variable and the
independent variable expressed as a concrete function and used in future prediction models. Moreover, the
result of the data is classified into a specific classification when the input data is given (Hosmer Jr., et al.,
2013).

Recall, Precision and F Measure

In machine learning method, which is based on statistics, F-measure is a well-known measurement for model
performance between predicted class and actual class using Recall and Precision. In this assignment, since the
dataset at the outset was highly imbalanced, we used multiple measures (Recall, Precision, F-measure &
AUROC) to determine validity of model as opposed to just traditional Accuracy statistics, to avoid what is
commonly referred to as the Accuracy Paradox – a paradoxical situation where predictive models with a given
level of accuracy may have greater predictive power than models with higher accuracy (Afonja, 2017).
Precision, Recall, and F-measure offer a suitable alternative to the traditional accuracy metric and offer
detailed insights about the algorithm under analysis.

ROC Curve

ROC curve is widely used to determine the efficiency of diagnostic methods and suitable method to visualize
classifier’s performance in order to select a suitable operating point, or decision threshold. The ROC curve is a
representation of sensitivity and specificity on the two-dimensional plane. The larger the area under the ROC
curve (AUC), the better the diagnostic method (J. A. Hanley, April 1982).

Page | 12
Financial Risk Analysis

Model Performance and Validation

After treating the data for the imbalance problem, logistic regression algorithm was applied to the processed
data. Please refer to Appendix 2: Model Architecture Diagram for the entire KNIME workflow and model
outputs. The Recall, Precision, F-measure, AUROC values after applying the classification algorithms is
summarised in the table below.

T ABLE 7: SUMMARY OF METRICS FOR LOGISTIC REGRESSION


Metrics After SMOTE After Random
Under Sampling
Default 0 1 0 1
True Positives 572 41 547 35
False Positives 13 89 19 114
True Negatives 41 572 35 547
False Negatives 89 13 114 19
Recall 0.865 0.759 0.828 0.648
Precision 0.978 0.315 0.966 0.235
Sensitivity 0.865 0.759 0.828 0.648
Specificity 0.759 0.865 0.648 0.828
F-measure 0.918 0.446 0.892 0.345
Accuracy 85.734% 81.399%
Cohen's kappa 0.379 0.263
Area Under Curve 91.361% 84.476%
A common aim of every business would be to maximize both precision and recall. However, the importance of
a metric depends on the business goal. In case of an algorithm for default detection, Recall is a more
important metric. It is obviously important to catch every possible default/risk even if it means to go through
some false positives. The aim is to reach the highest F-measure, but we usually reach a point from where we
can’t go any further (Jain, 2017). From the above table, it is observed that using Logistic Regression after
synthetically oversampling the data is the most efficient considering all the metrics. Please see chart below of
a comparison of performance of classification algorithms based on the Recall, Precision and F-measure. For the
various ROC curves, please refer to Appendix 3: Performance of Classification Algorithms.

Page | 13
Financial Risk Analysis

F IGURE 2: PERFORMANCE & VALIDATION

Page | 14
Financial Risk Analysis

Conclusion and Lessons Learnt

In this assignment, we have focused on the design of a framework that can offer early signals of a financial
distress to the business by means of algorithms that can deal with unbalanced and evolving data. We also
demonstrated the practicality of our project by using actual data.

Since the data set was highly imbalanced, rebalancing the classes before training a model was required to
offset sample bias error. It emerged that, without control on the data distribution, it is not possible to know
beforehand whether under-sampling/over-sampling is beneficial. We explored the use of sampling techniques,
both under-sampling and over-sampling, to deal with this challenge. We noticed that resampling methods
(notably over-sampling in our case) improved the performance of default prediction. The effectiveness of our
proposed model is assessed by non-traditional measures such as Recall, Precision, F-measure and AUC, which
have more relevance to the context of the business than traditional Accuracy statistics and helps overcome the
Accuracy Paradox. Relevance to the business problem, Recall was a more important measure.

Lessons Learnt

In this assignment, we have learnt some important lessons as summarised below:

• Feature selection is an important task that increases the validity and the accuracy of the model.
• Rebalancing a training set with sampling is not guaranteed to improve performances. Several factors
influence the final effectiveness of sampling and most of them cannot be controlled.
• There is no best technique for unbalanced classification and the best for a given dataset can vary with
the algorithm and accuracy measure adopted. Adaptive selection strategies can be effective to
rapidly provide an answer on which technique to use.
• The choice of learning algorithm for a default prediction depends on the accuracy measure that we
want to maximize, which goes beyond just the Accuracy statistics of the given model. In this scenario,
Recall and AUROC become a much better choice.

In conclusion, our assignment presents that use of machine learning algorithms helps to predict future
financial risk and make accurate predictions that in turn result in reduced losses and missed opportunities.
Machine Learning allows data to be processed and analysed continuously to understand the subtle patterns of
macro/micro-economic behaviour that may change frequently and beyond human comprehension.

Page | 15
Financial Risk Analysis

Bibliography

Afonja, T., 2017. Accuracy Paradox. [Online]


Available at: https://towardsdatascience.com/accuracy-paradox-897a69e2dd9b
[Accessed 16 January 2019].

Bagherzadeh-Khiabani, F. et al., March 2016. A tutorial on variable selection for clinical prediction models:
feature selection methods in data mining could improve the results. Journal of clinical epidemiology, Issue 71,
p. 76–85.

Baksai, K. E. P., 2010. Feature Selection to Detect Patterns in Supervised and Semi Supervised Scenarios,
Santiago, Chile: Pontificia Universidad Cat´olica de Chile.

Batista, G., Carvalho, A. & Monard, M., 2000. Applying one-sided selection to unbalanced datasets. Advances
in Artificial Intelligence, p. pages 315–325.

Berthold, M. R. et al., 2009. KNIME - the Konstanz information miner. ACM SIGKDD Explorations Newsletter,
11(1), p. 26.

Chawla, N. V., Japkowicz, N. & Kotcz., A., 2004. Editorial: special issue on learning from imbalanced data sets.
ACM SIGKDD Explorations Newsletter, p. 6(1):1–6.

Choi, D. & Lee, K., 2017. Machine Learning based Approach to Financial Fraud Detection, Seoul: CIST, Korea
University.

Han, H., Wang, W.-Y. & Mao, B.-H., 2005. Borderline-smote: a new over-sampling method in imbalanced data
sets learning. Volume 3644, p. 878–887.

Hartigan, J. A. & Wong, M. A., 1979. Algorithm as 136: A k-means clustering algorithm.. Journal of the Royal
Statistical Society, Volume Series C (Applied Statistics), p. 28(1):100–108.

Hosmer Jr., D. W., Lemeshow, S. & Sturdivant, R. X., 2013. Applied logistic regression, s.l.: John Wiley and Sons.

J. A. Hanley, B. J. M., April 1982. The meaning and use of the area under a receiver operating characteristic
(roc) curve. In: Radiology. s.l.:s.n., p. 29–36.

Jain, Y., 2017. Precision vs Recall - Demystifying Accuracy Paradox in Machine Learning. [Online]
Available at: https://www.newgenapps.com/blog/precision-vs-recall-accuracy-paradox-machine-learning
[Accessed 16 January 2019].

Liaw, A., Wiener, M. & al, e., 2002. Classification and regression by randomforest. R news, Issue 2(3), pp. 18–
22, .

Page | 16
Financial Risk Analysis

Li, J. et al., September 2016. Feature selection: A data perspective, s.l.: arXiv, 1601.07996,.

Powers, D. M., 2011. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and
correlation. Journal of Machine Learning Technologies, p. 2(1): 37–63.

Pozzolo, A. D., December 2015. Adaptive Machine Learning for Credit Card Fraud Detection, Brussells:
Université Libre de Bruxelles.

S. Panigrahi, A. K. S. S. a. A. K. M., n.d. Credit card fraud detection: A fusion approach using dempster–shafer
theory and bayesian learning. Information Fusion, Issue October 2009., p. 10(4):354–363.

Yen, S.-J. & Lee, Y.-S., 2009. Cluster-based under-sampling approaches for imbalanced data distributions. In:
Expert Systems with Applications. s.l.:s.n., p. 36(3):5718–5727.

Zhang, H., January 2004. The optimality of naive bayes. Miami, Florida, USA, 17th International Florida Artificial
Intelligence Research Society Conference.

Zhao, Z. & Liu, H., June 2007. Spectral feature selection for supervised and unsupervised learning.. Corvallis,
Oregon, USA, ACM, p. 1151–1157.

Page | 17
Financial Risk Analysis

Appendix 1: Dataset Exploratory Analysis

T ABLE 8: DATASET STATISTICS


Variables Mean Median Min Max Percentiles
99 95 1 5
Total assets 3443.4 309.7 0.1 1176509.2 51658.8 8452.9 1.7 10.6
Net worth 1295.9 102.3 0.0 613151.6 20920.8 3034.4 0.3 2.9
Total income 4582.8 444.9 0.0 2442828.2 43480.7 9339.7 0.2 5.7
Change in stock 41.5 1.6 -3029.4 14185.5 675.8 171.7 -271.3 -44.2
Total expenses 4262.9 407.7 -0.1 2366035.3 38580.8 8769.8 0.2 3.3
Profit after tax 277.4 8.8 -3908.3 119439.1 4345.5 595.1 -191.4 -14.6
PBDITA 578.1 35.4 -440.7 208576.5 7498.2 1237.9 -23.4 -0.4
PBT 383.8 12.4 -3894.8 145292.6 5803.4 776.1 -199.6 -15.7
Cash profit 392.1 18.9 -2245.7 176911.8 5818.5 823.7 -114.7 -5.7
PBDITA as % of 4.6 9.7 -6400.0 100.0 81.3 34.6 -64.4 -0.6
total income
PBT as % of total -17.3 3.3 -21340.0 100.0 51.9 23.2 -239.4 -22.2
income
PAT as % of total -19.2 2.3 -21340.0 150.0 43.8 18.2 -226.0 -21.7
income
Cash profit as % -8.2 5.6 -15020.0 100.0 56.8 24.5 -133.6 -10.3
of total income
PAT as % of net 10.3 7.9 -748.7 2466.7 97.4 50.2 -138.5 -28.3
worth
Sales 4549.5 453.1 0.1 2384984.4 42040.9 9224.9 0.5 7.5
Income from 80.8 1.8 0.0 51938.2 1036.5 144.2 0.1 0.1
financial services
Other income 41.4 1.4 0.0 42856.7 368.1 67.9 0.1 0.1
Total capital 216.6 42.1 0.1 78273.2 2938.7 608.8 0.5 1.5
Reserves and 1163.8 54.8 -6525.9 625137.8 17697.7 2789.1 -228.3 -40.2
funds
Borrowings 1122.3 99.2 0.1 278257.3 17533.9 2948.0 0.2 1.5
Current liabilities 940.6 69.4 0.1 352240.3 11524.5 2094.5 0.1 1.2
& provisions
Deferred tax 227.2 13.4 0.1 72796.6 3536.7 480.9 0.1 0.4
liability
Shareholders 1322.1 105.6 0.0 613151.6 20920.8 3160.0 0.3 3.0
funds
Cumulative 890.5 37.1 -6534.3 390133.8 13090.1 2069.9 -496.7 -65.8
retained profits
Capital employed 2328.3 214.7 0.0 891408.9 34914.6 5988.7 1.1 7.4
TOL/TNW 4.0 1.4 -350.5 473.0 56.0 10.5 0.0 0.0
Total term 1.8 0.3 -325.6 456.0 29.5 4.2 0.0 0.0
liabilities /
tangible net
worth
Contingent 53.9 5.3 0.0 14704.3 773.8 151.0 0.0 0.0
liabilities / Net
worth (%)
Contingent 932.9 38.0 0.1 559506.8 12158.3 1846.8 0.1 0.4
liabilities
Net fixed assets 1189.7 93.5 0.0 636604.6 17183.0 2811.0 0.3 2.6
Investments 694.7 8.4 0.0 199978.6 11054.2 1252.4 0.1 0.1
Current assets 1293.4 145.1 0.1 354815.2 17836.5 3327.1 0.2 1.9

Page | 18
Financial Risk Analysis

Net working 138.6 16.2 -63839.0 85782.8 3688.7 679.0 -1715.8 -159.7
capital
Quick ratio 1.4 0.7 0.0 341.0 14.6 3.0 0.0 0.1
(times)
Current ratio 2.1 1.2 0.0 505.0 19.0 4.3 0.0 0.4
(times)
Debt to equity 2.8 0.8 0.0 456.0 37.1 6.8 0.0 0.0
ratio (times)
Cash to current 0.5 0.1 0.0 165.0 5.9 1.3 0.0 0.0
liabilities (times)
Cash to average 158.4 8.0 0.0 128040.8 1277.5 190.5 0.0 0.2
cost of sales per
day
Creditors 15.4 6.1 0.0 2401.0 135.3 44.5 0.0 0.0
turnover
Debtors turnover 17.0 6.3 0.0 3135.2 202.1 45.5 0.0 0.0
Finished goods 87.1 17.3 -0.1 17947.6 1123.0 202.0 0.7 2.5
turnover
WIP turnover 27.9 9.8 -0.2 5651.4 268.9 77.3 0.2 1.4
Raw material 19.1 6.4 -2.0 21092.0 102.4 34.8 0.0 0.0
turnover
Shares 22067387.5 4672063.0 - 4130400545.0 359249712.0 79982549.2 14306.4 83257.0
outstanding 2147483647.0
Equity face value -1333.7 10.0 -999998.9 100000.0 1000.0 100.0 1.0 6.2
EPS -220.3 1.4 -843181.8 34522.5 896.1 87.7 -60.3 -4.4
Adjusted EPS -221.5 1.2 -843181.8 34522.5 896.1 84.2 -60.3 -4.4
Total liabilities 3443.4 309.7 0.1 1176509.2 51658.8 8452.9 1.7 10.6
PE on BSE 63.9 9.1 -1116.6 51002.7 362.6 70.1 -133.7 -27.1
Networth Next 1616.3 116.3 -74265.6 805773.4 25534.1 3764.4 -77.6 -1.4
Year

T ABLE 9: OUTLIER T REATMENT SUMMARY


Outlier column Outlier count
Total assets 484
Net worth 499
Total income 456
Change in stock 782
Total expenses 455
Profit after tax 601
PBDITA 510
PBT 596
Cash profit 537
PBDITA as % of total income 301
PBT as % of total income 469
PAT as % of total income 509
Cash profit as % of total income 364
PAT as % of net worth 345
Sales 452
Income from financial services 588

Page | 19
Financial Risk Analysis

Other income 613


Total capital 444
Reserves and funds 543
Borrowings 510
Current liabilities & provisions 494
Deferred tax liability 548
Shareholders funds 495
Cumulative retained profits 578
Capital employed 477
TOL/TNW 335
Total term liabilities / tangible net worth 329
Contingent liabilities / Net worth (%) 391
Contingent liabilities 603
Net fixed assets 485
Investments 690
Current assets 465
Net working capital 683
Quick ratio (times) 321
Current ratio (times) 348
Debt to equity ratio (times) 310
Cash to current liabilities (times) 452
Cash to average cost of sales per day 498
Creditors turnover 418
Debtors turnover 383
Finished goods turnover 484
WIP turnover 416
Raw material turnover 316
Shares outstanding 513
Equity face value 436
EPS 540
Adjusted EPS 582
Total liabilities 484
PE on BSE 1346
Networth Next Year 506

Page | 20
Financial Risk Analysis

Appendix 2: Model Architecture Diagram

F IGURE 3: M ODEL A RCHITECTURE D IAGRAM - KNIME

Page | 21
Financial Risk Analysis

Page | 22
Financial Risk Analysis

Appendix 3: Performance of Models

F IGURE 4: L OGISTIC REGRESSION AFTER SMOTE

Page | 23
Financial Risk Analysis

F IGURE 5: L OGISTIC REGRESSION A FTER RANDOM UNDER SAMPLING

Page | 24

You might also like