Financial Risk Analysis: Great Learning PGPBABI 2017
Financial Risk Analysis: Great Learning PGPBABI 2017
Financial Risk
Analysis
Group Assignment
Financial Risk Analysis
Contents
Team ....................................................................................................................................................................... 3
Abbreviations.......................................................................................................................................................... 3
Introduction ............................................................................................................................................................ 4
Pre-processing .................................................................................................................................................... 7
Sampling ........................................................................................................................................................... 10
Bibliography .......................................................................................................................................................... 16
Page | 1
Financial Risk Analysis
List of Figures
List of Tables
Page | 2
Financial Risk Analysis
Team
Abbreviations
ML Machine Learning
TP True Positive
TN True Negative
FP False Positive
FN False Negative
Page | 3
Financial Risk Analysis
Introduction
This assignment presents a financial distress prediction model that uses logistic regression to classify between
healthy and financially distressed firms. Past research has largely ignored the precision of the minority group
and focused on the overall accuracy percentage. We present a model linking microeconomic variables directly
to an aggregate measure of credit risk in organisations and the suggested method serves as an early warning
system for firms in financial distress.
Keywords: Machine Learning, Feature Selection, Predictive Modelling, Logistic Regression, Recall, F-measure,
AUROC, SMOTE, KNIME, Financial Risk Analysis
The scope of this assignment is limited to the analysis of the available data and predict a classification outcome
of the categorical dependent variable (default) using logistic regression. The objective is to better predict the
likelihood of default by an organisation, as well as identify the key drivers that determine this likelihood to
explore the data provided in the dataset and use logistic regression to build a predictive model and screen
distressed companies from healthy ones.
Data Source
The dataset is a historical input data Indian companies listed on the Bombay Stock Exchange. For the
protection of personal information, key information and data which can identify an organisation has been
anonymized.
Multiple platforms were utilized at various stages of the assignment life cycle, including KNIME. MS Excel was
used for preliminary statistical data exploration. Logistic Regression was used for prediction.
The data had many variables contained 51 predictor variables, with lots of missing values that needed
treatment and was highly unbalanced in nature and contained a data imbalance problem. This required
feature selection and artificial treatment through sampling techniques.
Page | 4
Financial Risk Analysis
The proposed model in this assignment consists of data pre-processing, treatment of outliers, feature
selection, sampling and application of regression algorithm based on machine learning.
Data Description
The dataset is a historical financial data of organisations and includes 4256 observations. It has 50 variables
including the response variable. The data set is not labelled at the outset. However, it is implied that Net worth
of next if greater than 0 is non-default, with label 0 referring to healthy companies and 1 referring to
companies with impending financial risk. The structure of the data is as below. The data was at the outset split
into a raw training data set (3541 observations) and a raw validation dataset (715 observations).
Page | 5
Financial Risk Analysis
Deferred tax liability Liability to be cleared Next Year Pos Predictor Numeric
Shareholders’ funds Equity - Size Measure Neg Predictor Numeric
Cumulative retained Equity - Size Measure Neg Predictor Numeric
profits
Capital employed Capital - Size Measure Neg Predictor Numeric
TOL/TNW Leverage - Ratio Pos Predictor Numeric
Total term liabilities / Leverage - Ratio Pos Predictor Numeric
tangible net worth
Contingent liabilities / Net Short term Leverage - Ratio Pos Predictor Numeric
worth (%)
Contingent liabilities Debt Pos Predictor Numeric
Net fixed assets Size Neg Predictor Numeric
Investments Size Neg Predictor Numeric
Current assets Size - short term Neg Predictor Numeric
Net working capital Size Neg Predictor Numeric
Quick ratio (times) Liquidity Neg Predictor Numeric
Current ratio (times) Liquidity Neg Predictor Numeric
Debt to equity ratio Leverage Pos Predictor Numeric
(times)
Cash to current liabilities Liquidity Neg Predictor Numeric
(times)
Cash to average cost of Liquidity Neg Predictor Numeric
sales per day
Creditors turnover Liquidity (rate at which company pays Neg Predictor Numeric
its creditors - higer speed -lowe
likelihood to def
Debtors turnover Liquidity - how effective company Neg Predictor Numeric
extending credit and recovering
collectible - how efficiently firm uses
asset. Higher value lower likelihood to
def
Finished goods turnover Liquidity - High Ratio - quick good Neg Predictor Numeric
clearance - lower likelihood to def
WIP turnover Liquidity Neg Predictor Numeric
Raw material turnover Liquidity - higher the value - lower Neg Predictor Numeric
number of days raw materials stayed
in inventory - lower is def likelihood
Shares outstanding Size Neg Predictor Numeric
Equity face value Size Neg Predictor Numeric
EPS Profitability - Market Expectation Neg Predictor Numeric
Adjusted EPS Profitability - Market Neg Predictor Numeric
Total liabilities Liabilities - Debt Neg Predictor Numeric
PE on BSE Market Expectation of the Company Neg Predictor Numeric
prospect
Page | 6
Financial Risk Analysis
The dataset has been statistically analysed on each predictor variable to understand the quality prior to taking
pre-processing & cleansing related steps. The dataset summary indicated there were missing values that
needed to be addressed before conducting data manipulation & exploration techniques. For statistics table,
please refer to the Raw Training Data Statistics in Appendix 1: Dataset Exploratory Analysis.
Pre-processing
Once the raw training dataset was read, a missing value treatment was done on it. The missing values for each
of the predictor variables were replaced by their respective column median values.
Page | 7
Financial Risk Analysis
Outlier treatment was carried out on the dataset by considering the 1st percentile of each variable as the floor
and the 99th percentile as the cap. For the outlier’s treatment summary, please refer to the Outliers Summary
in Appendix 1: Dataset Exploratory Analysis.
Further the predictor variables were converted to ratios. The response variable (Default) was next converted
to string. Subsequent to ratio creation & string manipulation, the cleansed dataset was further explored on
correlation factors on individual predictors.
Proportion of Default ‘1’ is sparse in the Dataset, indicating that the given dataset is already biased towards
organisations compliant with a healthy (non-default) status.
By using a ratio between mean of variables values of Default and non-default, and by using a decision tree, it is
evident that some of the predictor variables are the most significant to the response variable (Default). These
variables are areas which the business may focus to reduce the risk of a financial crisis or increase credit
compliance.
Page | 8
Financial Risk Analysis
Feature Selection
Feature selection is a proven step which makes the data mining and machine learning efficient and effective.
The objectives of feature selection include building simpler and more comprehensible models, improving data
mining performance (predictive accuracy and comprehensibility), and preparing clean (redundancy and
irrelevancy removal), understandable data (Zhao & Liu, June 2007) (Baksai, 2010). Correlation checks of target
variable with individual predictor variables was carried out to ascertain significance. A correlation matrix, as
shown below, was generated. The colour range varies from dark red (strong negative correlation), over white
(no correlation) to dark blue (strong positive correlation). If a correlation value for a pair of columns is not
available, it is represented by X.
Page | 9
Financial Risk Analysis
As observed from the graphs, there is high multi-collinearity observed in many variables. The Correlation Filter
node in KNIME uses the model as generated by a Correlation node to determine which columns are redundant
(i.e. correlated) and filters them out. The method applied here is known to be good approximation 1.
We started with 43 variables and determined the mean for each variable corresponding to the default and the
non-default response. Next a ratio was determined between the means (default and non-default). A
significance test was further carried out. Using this information and the based on the business domain
knowledge, we filtered out 22 variables.
Sampling
A real financial dataset of default is unbalanced in nature and contains a data imbalance problem leading to
Sample selection bias. Learning from unbalanced datasets is a difficult task since most learning algorithms are
1
https://nodepit.com/node/org.knime.base.node.preproc.correlation.filter.CorrelationFilterNodeFactory
Page | 10
Financial Risk Analysis
not designed to cope with a large difference between the number of cases belonging to different classes
(Batista, et al., 2000). There are several methods that deal with this problem and we can distinguish between
methods that operate at the data and algorithmic levels (Chawla, et al., 2004). Previous research has proposed
a random minority over-sampling approach and cluster-based under-sampling approach for selecting the
representative data as training data to improve the classification accuracy for minority class (Yen & Lee, 2009).
In this assignment, we have explored the use of both Synthetic Minority Over-sampling Technique (SMOTE)
and Random Under Sampling (Equal Size sampling node in KNIME platform) technique to overcome this
problem and create a more efficient experiment.
SMOTE is an over-sampling technique that uses a method of generating arbitrary examples, rather than simply
oversampling through duplication or replacement (Han, et al., 2005).
The Equal Size sampling node in KNIME platform removes rows from the input data set such that the values in
a categorical column are equally distributed. This is done to downsize the data set so that the class attributes
occur equally often in the data set. The node will remove random rows belonging to the majority classes. The
rows returned by this node will contain all records from the minority class(es) and a random sample from each
of the majority classes, whereby each sample contains as many objects as the minority class contains 2.
We explored modelling in multiple ways and test them for efficiency. The training dataset was first split in a
80:20 ratio into a training and test dataset respectively. The training dataset was further treated for imbalance
using both Over-sampling (SMOTE) and Random Under Sampling (Equal Size Sampling) techniques. Please see
the Data Imbalance Treatment Summary.
2
https://nodepit.com/node/org.knime.base.node.preproc.equalsizesampling.EqualSizeSamplingNodeFactory
Page | 11
Financial Risk Analysis
Machine Learning
Machine Learning plays a vital role in default detection which is data driven. In this assignment, we have
focussed on supervised learning where the algorithm is trained on labelled examples. Supervised learning
assumes the availability of labelled samples, i.e. observations annotated with their output, which can be used
to train a learner. In the training set we can distinguish between input features and an output variable that is
assumed to be dependent on the inputs. The output, or response variable, defines the class of observations
and the input features are the set of variables that have some influence on the output and are used to predict
the value of the response variable.
As required, logistic regression algorithms were applied for accurate classification of default data in the
transaction dataset. Logistic regression is a statistical technique that is used to predict the likelihood of an
event using a linear combination of independent variables as a probability model. The objective of logistic
regression is to provide a general regression in the relationship between the dependent variable and the
independent variable expressed as a concrete function and used in future prediction models. Moreover, the
result of the data is classified into a specific classification when the input data is given (Hosmer Jr., et al.,
2013).
In machine learning method, which is based on statistics, F-measure is a well-known measurement for model
performance between predicted class and actual class using Recall and Precision. In this assignment, since the
dataset at the outset was highly imbalanced, we used multiple measures (Recall, Precision, F-measure &
AUROC) to determine validity of model as opposed to just traditional Accuracy statistics, to avoid what is
commonly referred to as the Accuracy Paradox – a paradoxical situation where predictive models with a given
level of accuracy may have greater predictive power than models with higher accuracy (Afonja, 2017).
Precision, Recall, and F-measure offer a suitable alternative to the traditional accuracy metric and offer
detailed insights about the algorithm under analysis.
ROC Curve
ROC curve is widely used to determine the efficiency of diagnostic methods and suitable method to visualize
classifier’s performance in order to select a suitable operating point, or decision threshold. The ROC curve is a
representation of sensitivity and specificity on the two-dimensional plane. The larger the area under the ROC
curve (AUC), the better the diagnostic method (J. A. Hanley, April 1982).
Page | 12
Financial Risk Analysis
After treating the data for the imbalance problem, logistic regression algorithm was applied to the processed
data. Please refer to Appendix 2: Model Architecture Diagram for the entire KNIME workflow and model
outputs. The Recall, Precision, F-measure, AUROC values after applying the classification algorithms is
summarised in the table below.
Page | 13
Financial Risk Analysis
Page | 14
Financial Risk Analysis
In this assignment, we have focused on the design of a framework that can offer early signals of a financial
distress to the business by means of algorithms that can deal with unbalanced and evolving data. We also
demonstrated the practicality of our project by using actual data.
Since the data set was highly imbalanced, rebalancing the classes before training a model was required to
offset sample bias error. It emerged that, without control on the data distribution, it is not possible to know
beforehand whether under-sampling/over-sampling is beneficial. We explored the use of sampling techniques,
both under-sampling and over-sampling, to deal with this challenge. We noticed that resampling methods
(notably over-sampling in our case) improved the performance of default prediction. The effectiveness of our
proposed model is assessed by non-traditional measures such as Recall, Precision, F-measure and AUC, which
have more relevance to the context of the business than traditional Accuracy statistics and helps overcome the
Accuracy Paradox. Relevance to the business problem, Recall was a more important measure.
Lessons Learnt
• Feature selection is an important task that increases the validity and the accuracy of the model.
• Rebalancing a training set with sampling is not guaranteed to improve performances. Several factors
influence the final effectiveness of sampling and most of them cannot be controlled.
• There is no best technique for unbalanced classification and the best for a given dataset can vary with
the algorithm and accuracy measure adopted. Adaptive selection strategies can be effective to
rapidly provide an answer on which technique to use.
• The choice of learning algorithm for a default prediction depends on the accuracy measure that we
want to maximize, which goes beyond just the Accuracy statistics of the given model. In this scenario,
Recall and AUROC become a much better choice.
In conclusion, our assignment presents that use of machine learning algorithms helps to predict future
financial risk and make accurate predictions that in turn result in reduced losses and missed opportunities.
Machine Learning allows data to be processed and analysed continuously to understand the subtle patterns of
macro/micro-economic behaviour that may change frequently and beyond human comprehension.
Page | 15
Financial Risk Analysis
Bibliography
Bagherzadeh-Khiabani, F. et al., March 2016. A tutorial on variable selection for clinical prediction models:
feature selection methods in data mining could improve the results. Journal of clinical epidemiology, Issue 71,
p. 76–85.
Baksai, K. E. P., 2010. Feature Selection to Detect Patterns in Supervised and Semi Supervised Scenarios,
Santiago, Chile: Pontificia Universidad Cat´olica de Chile.
Batista, G., Carvalho, A. & Monard, M., 2000. Applying one-sided selection to unbalanced datasets. Advances
in Artificial Intelligence, p. pages 315–325.
Berthold, M. R. et al., 2009. KNIME - the Konstanz information miner. ACM SIGKDD Explorations Newsletter,
11(1), p. 26.
Chawla, N. V., Japkowicz, N. & Kotcz., A., 2004. Editorial: special issue on learning from imbalanced data sets.
ACM SIGKDD Explorations Newsletter, p. 6(1):1–6.
Choi, D. & Lee, K., 2017. Machine Learning based Approach to Financial Fraud Detection, Seoul: CIST, Korea
University.
Han, H., Wang, W.-Y. & Mao, B.-H., 2005. Borderline-smote: a new over-sampling method in imbalanced data
sets learning. Volume 3644, p. 878–887.
Hartigan, J. A. & Wong, M. A., 1979. Algorithm as 136: A k-means clustering algorithm.. Journal of the Royal
Statistical Society, Volume Series C (Applied Statistics), p. 28(1):100–108.
Hosmer Jr., D. W., Lemeshow, S. & Sturdivant, R. X., 2013. Applied logistic regression, s.l.: John Wiley and Sons.
J. A. Hanley, B. J. M., April 1982. The meaning and use of the area under a receiver operating characteristic
(roc) curve. In: Radiology. s.l.:s.n., p. 29–36.
Jain, Y., 2017. Precision vs Recall - Demystifying Accuracy Paradox in Machine Learning. [Online]
Available at: https://www.newgenapps.com/blog/precision-vs-recall-accuracy-paradox-machine-learning
[Accessed 16 January 2019].
Liaw, A., Wiener, M. & al, e., 2002. Classification and regression by randomforest. R news, Issue 2(3), pp. 18–
22, .
Page | 16
Financial Risk Analysis
Li, J. et al., September 2016. Feature selection: A data perspective, s.l.: arXiv, 1601.07996,.
Powers, D. M., 2011. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and
correlation. Journal of Machine Learning Technologies, p. 2(1): 37–63.
Pozzolo, A. D., December 2015. Adaptive Machine Learning for Credit Card Fraud Detection, Brussells:
Université Libre de Bruxelles.
S. Panigrahi, A. K. S. S. a. A. K. M., n.d. Credit card fraud detection: A fusion approach using dempster–shafer
theory and bayesian learning. Information Fusion, Issue October 2009., p. 10(4):354–363.
Yen, S.-J. & Lee, Y.-S., 2009. Cluster-based under-sampling approaches for imbalanced data distributions. In:
Expert Systems with Applications. s.l.:s.n., p. 36(3):5718–5727.
Zhang, H., January 2004. The optimality of naive bayes. Miami, Florida, USA, 17th International Florida Artificial
Intelligence Research Society Conference.
Zhao, Z. & Liu, H., June 2007. Spectral feature selection for supervised and unsupervised learning.. Corvallis,
Oregon, USA, ACM, p. 1151–1157.
Page | 17
Financial Risk Analysis
Page | 18
Financial Risk Analysis
Net working 138.6 16.2 -63839.0 85782.8 3688.7 679.0 -1715.8 -159.7
capital
Quick ratio 1.4 0.7 0.0 341.0 14.6 3.0 0.0 0.1
(times)
Current ratio 2.1 1.2 0.0 505.0 19.0 4.3 0.0 0.4
(times)
Debt to equity 2.8 0.8 0.0 456.0 37.1 6.8 0.0 0.0
ratio (times)
Cash to current 0.5 0.1 0.0 165.0 5.9 1.3 0.0 0.0
liabilities (times)
Cash to average 158.4 8.0 0.0 128040.8 1277.5 190.5 0.0 0.2
cost of sales per
day
Creditors 15.4 6.1 0.0 2401.0 135.3 44.5 0.0 0.0
turnover
Debtors turnover 17.0 6.3 0.0 3135.2 202.1 45.5 0.0 0.0
Finished goods 87.1 17.3 -0.1 17947.6 1123.0 202.0 0.7 2.5
turnover
WIP turnover 27.9 9.8 -0.2 5651.4 268.9 77.3 0.2 1.4
Raw material 19.1 6.4 -2.0 21092.0 102.4 34.8 0.0 0.0
turnover
Shares 22067387.5 4672063.0 - 4130400545.0 359249712.0 79982549.2 14306.4 83257.0
outstanding 2147483647.0
Equity face value -1333.7 10.0 -999998.9 100000.0 1000.0 100.0 1.0 6.2
EPS -220.3 1.4 -843181.8 34522.5 896.1 87.7 -60.3 -4.4
Adjusted EPS -221.5 1.2 -843181.8 34522.5 896.1 84.2 -60.3 -4.4
Total liabilities 3443.4 309.7 0.1 1176509.2 51658.8 8452.9 1.7 10.6
PE on BSE 63.9 9.1 -1116.6 51002.7 362.6 70.1 -133.7 -27.1
Networth Next 1616.3 116.3 -74265.6 805773.4 25534.1 3764.4 -77.6 -1.4
Year
Page | 19
Financial Risk Analysis
Page | 20
Financial Risk Analysis
Page | 21
Financial Risk Analysis
Page | 22
Financial Risk Analysis
Page | 23
Financial Risk Analysis
Page | 24