Paper 28
Paper 28
Article
Financial Fraud Detection and Prediction in Listed Companies
Using SMOTE and Machine Learning Algorithms
Zhihong Zhao 1, * and Tongyuan Bai 2
1 School of Applied Science and Civil Engineering, Beijing Institute of Technology, Zhuhai 519085, China
2 Faculty of Natural, Mathematical and Engineering Sciences, King’s College, London WC2R 2LS, UK
* Correspondence: [email protected]
Abstract: This paper proposes a new method that can identify and predict financial fraud among
listed companies based on machine learning. We collected 18,060 transactions and 363 indicators
of finance, including 362 financial variables and a class variable. Then, we eliminated 9 indicators
which were not related to financial fraud and processed the missing values. After that, we extracted
13 indicators from 353 indicators which have a big impact on financial fraud based on multiple feature
selection models and the frequency of occurrence of features in all algorithms. Then, we established
five single classification models and three ensemble models for the prediction of financial fraud
records of listed companies, including LR, RF, XGBOOST, SVM, and DT and ensemble models with a
voting classifier. Finally, we chose the optimal single model from five machine learning algorithms
and the best ensemble model among all hybrid models. In choosing the model parameter, optimal
parameters were selected by using the grid search method and comparing several evaluation metrics
of models. The results determined the accuracy of the optimal single model to be in a range from 97%
to 99%, and that of the ensemble models as higher than 99%. This shows that the optimal ensemble
model performs well and can efficiently predict and detect fraudulent activity of companies. Thus,
a hybrid model which combines a logistic regression model with an XGBOOST model is the best
Citation: Zhao, Z.; Bai, T. Financial
among all models. In the future, it will not only be able to predict fraudulent behavior in company
Fraud Detection and Prediction in management but also reduce the burden of doing so.
Listed Companies Using SMOTE and
Machine Learning Algorithms. Keywords: financial fraud; feature selection; classification algorithms; grid search; voting
Entropy 2022, 24, 1157. https://
doi.org/10.3390/e24081157
Handoko [5] found that financial difficulties, such as income stability and management of
liquidity also have a significant impact on financial fraud.
In China, Luckin Coffee [6] increased its sales revenue and recorded the false price of
a cup of coffee. In February 2020, Muddy Waters published a report about the behavior
of financial fraud and the company admitted financial fraud and moved out of the stock
market on May 19. Wanfushengke [7] entered the main board market and managers falsified
the sales data and were under investigation by the competent authorities in 2012. Finally,
it was punished by the government and never allowed into the stock market. Zhangzi
Island [8] used natural factors to conceal property losses caused by human factors. At the
same time, investigators lacked relevant information and data to identify fraudulent acts
indicated by false data.
In this article, we set up 363 indicators, including one interest variable and 362 other
variables. The dependent variable is a class variable. When the value is 0, it means that the
company is legitimate. When the value is 1, it indicates fraudulent activity of the company.
In addition, these independent variables are related to the state of finance of the company.
In that case, we must do our best to select important independent indicators of financial
state that can detect financial fraud immediately and rapidly. In this paper, we extracted
the most relevant indicators and built a classification model by using machine learning
(ML) algorithms that can effectively identify and predict whether or not the company has
committed financial data fraud, such as Logistic Regression (LR), Support Vector Machine
(SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST). In addition, we
also proposed a voting classifier that can use two classification models at the same time.
The principal contribution of this research may be summarised as follows:
• We have comprehensively investigated the issues of fraud detection in listed compa-
nies related to imbalanced data classification in machine learning.
• The number of financial fraud records is 18,060 and the number of indicators is 353.
This means that a large amount of data is the basis of our research; thus increasing the
probability of accurate prediction.
• We used the Synthetic Minority Oversampling Technique (SMOTE) to address imbal-
ance in the data set.
• We proposed the new framework when we chose reasonable indicators which have
a big impact on the type of companies, such as Logistic Regression (LR), Random
Forest (RF), Gradient Boosting Decision Tree (GBDT), Decision Tree (DT), and Extreme
Gradient Boosting (XGBOOST).
• We carried out grid search that can choose the best parameter when constructing
machine learning models, including Logistic Regression (LR), Support Vector Machine
(SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST).
• We implemented voting classifier method with several ML methods to get more
reasonable and scientific results compared with single classification models of reviews.
The rest of this paper is structured as follows. Related work is reviewed in the Section 2.
Then, we plan to introduce the data set and describe the framework of research about how
to select features and construct single and multiple classifiers and evaluate the performance
of them in the Section 3. As for the Section 4, we will discuss and analyse these results.
Finally, we will draw up a conclusion in the final part.
2. Related Work
Currently, a variety of factors affect the detection of financial fraud. It is necessary
for us to build feature selection models and classification models based on a number of
samples and numerous indicators, which can increase the accuracy of prediction compared
with previous work and provide meaningful suggestions for investors and authorities.
Imbalanced data sets have been a significant challenge in recent years. Mohammed
et al. [9] built several machine learning models without choosing their parameters. The re-
sults showed that Random Forest, Balanced Bagging Ensemble, and Gaussian Naïve Bayes
have a good performance. However, it is difficult to maintain similar results in massive data
Entropy 2022, 24, 1157 3 of 17
sets. Kaur et al. [10] proposed a review for the solutions of imbalanced data sets and the
pros and cons of machine learning, including neural networks, k-nearest neighbour(KNN),
and so on. SMOTE is a better method compared with the under-sampling and hybrid
sampling methods. Ganganwar [11] found that current research in imbalanced data sets
are focusing on hybrid algorithms, such as bagging models and ensemble models.
In terms of feature selection models, Neumann et al. [12] suggested new approaches
to the feature selection, including linear and Support Vector Machine classifiers. At the
same time, they applied the different convex function, a general framework for non-convex
continuous optimization. Tang et al. [13] studied different methods of feature selection for
different types of features, such as character or numeric data. In terms of the method of
feature selection, researchers can use filter models, wrapper models and embedded models.
For streaming functions, they used the graft algorithm, alpha investing algorithms and
selection of online streaming functions. Guyon et al. [14] proposed evaluation criteria for
feature selection prior to building machine learning models, such as the importance of
indicators and how to establish correlation criteria. Then, these researchers pay attention
to the type of variable when selecting features. Wrappers are used to choose features
according to the performance of the classifier. Filters worked as a pre-processing step
and in embedded ways select variables during the training process. Kohavi et al. [15]
studied the filter method and compared it with the pros and cons of a wrapper method.
The result showed that the latter can improve accuracy in decision trees and naive-Bayes.
Omar et al. [16] studied classification, whether using feature selection or not. The result
showed a higher accuracy and reduced classifier workload by selection of the relevant
variable. Coelho et al. [17] proposed a new estimator based on mutual information that
can process both continuous and discrete variables simultaneously. It is always used for
pattern recognition and feature selection.
In terms of classification models, Bell et al. [18] studied companies in the field of
finance and technology and used logistic regression models for analysis. It was proven
that income fell as the risk of the company’s financial data fraud increases substantially.
If the rate of increase is too high, there is a high risk of fraud. Spathis et al. [19] studied
10 indicators of funding of 76 enterprises and used a multi-criteria analysis method to
evaluate the performance and a multivariate statistical method for analysis. The results
showed that indicator selection plays an important role in detecting the class of companies.
Kirkos E et al. [20] selected 10 factors in 76 Greek manufacturing companies and carried
out the Decision Tree, Neural Network Model, and Bayesian Network model to conduct
experiments. The results showed that the accuracy of Bayesian network model at 90.3%. It
means that the Bayesian network model can predict more accurately than other models
whether the company has fraudulent financial data or not. CJ Skousen et al. [21] selected
companies which were punished by the US Securities and Exchange Commission from
1992 to 2001 as the research object and built the logistic regression model based on the fraud
triangle theory. It showed that there was a relationship between the demand for cash and
financial fraud. Ravisankar et al. [22] used data mining techniques such as Multilayer Feed
Forward Neural Network (MLFF), Support Vector Machines (SVM), and so on to identify
financial statements. Glancy et al. [23] proposed a computational fraud detection model
(CFDM) for detecting fraud in financial reporting.
In summary, the researchers compared different algorithms to address the issue of
imbalanced data sets and detect fraud. The studies of the researchers mentioned cannot
set up the best parameters of machine learning methods which play an important in
performance of models. At the same time, when most research focused on the performance
metrics, they only used the accuracy as the main indicator instead of many different
performance metrics. Therefore, we need to build various classification models based on
large samples and many indicators in the data. In addition, we evaluated these models
with performance metrics. Neural network is quite diffcult to explain the meaning of each
indicator and the running time is longer than machine learning models. We proposed
Voting which used two different algorithms. This paper selected the model with the best
Entropy 2022, 24, 1157 4 of 17
performance for identifying and predicting financial fraud of listed companies which is
beneficial to make up for this shortcoming.
3. Research Methology
3.1. Framework
This paper divides listed companies into two types, such as fraudulent company and
common company. We used machine learning methods to select the financial indicators
of listed companies and build machine learning prediction models and evaluate results
according to performance metrics. The process for this paper is divided into five phases.
Firstly, we processed data, such as missing values and outliers. Then, we used Logistic
Regression (LR), Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Decision
Tree (DT), and Extreme Gradient Boosting (XGBOOST) models to select the top 20% of
features after preprocessing and building models. In addition, we counted the frequency
of these indicators when we extracted. When the frequency is higher than 4 times, the
indicator will be chosen. After selecting, we applied the Synthetic Minority Over-sampling
Technique (SMOTE) method to ensure a balance of the class type of samples between
training data set and testing data set. In the fourth step, we used the Logistic Regression
(LR), Random Forest (RF), Extreme Gradient Boosting (XGBOOST), Support Vector Machine
(SVM), and Decision Tree (DT) to predict whether the company has financial fraud problems
or not. Then, we used grid search to choose the best parameter of each model and evaluate
these models with four performance metrics, such as accuracy, recall, precision and area
under curve (AUC). Finally, we chose the best machine learning models from four models
and combined them with majority voting which have better performance. The overall
framework of the proposed intelligent approach for fraud detection is shown in Figure 1.
Figure 1. The overall framework of the proposed intelligent approach for fraud detection.
Entropy 2022, 24, 1157 5 of 17
xij − x j
xij∗ = (i = 1, 2, . . . , 18,060, j = 1, 2, . . . , 240) (1)
sj
According to Figure 2, we used five different methods that can select features from
data set. Then, we explained the details of standards of selection. In Logistic Regression
model and Extreme Gradient Boosting model, we trained model and sorted them by the
weight of independent variables. In addition, we sorted them by the gini index in Random
Forest, Gradient Boosting Decision Tree, and Decision Tree. The equation is shown in (2)
K
Gini ( D ) = 1 − ∑ p2k (2)
k =1
In Equation (2), k denotes that the number of type and pi means that the percentage of
the ith type. When the gini is higher, it means that the system or model is not steady. At the
same time, these three models are tree models and it is clear that the gini can select features
more effectively. In addition, we also did some experiments on selecting by information
entropy instead of gini. There are two disadvantages about information entropy. Firstly,
the speed of calculation is lower. Thus, gini is more suitable when we need to build
tree models.
by a random number between 0 and 1 and adds the output number to feature vector to
identify a new point on the line segment. The equation is shown in (3) [25].
Equation (4) shows that p is the predicting value and q is the true value. When p is
getting close to q, the loss function is the lowest.
The random forest model [28] is an ensemble learning model that can analyze data that
have numerous features. In general, the algorithm selects the random subset of features for
training once. At the same time, it can save time and is easily implemented compared with
others. In addition, the algorithm also introduces randomness, which can effectively avoid
the over-fitting phenomenon.
The extreme gradient boosting model [29] was proposed by Chen and Guestrin, which
is a kind of gradient boosting algorithm. It breaks the computational limitations of ensemble
models by accumulating iterations. It uses the cumulative sum of the predicted values of
the samples in each tree as the predicted values of the samples in the system. By comparing
the predicted results of the model in many research areas, the XGBOOST model has better
performance than other models.
The support vector machine [30] is a supervised ML used for binary classification
regression and classification subjects. The algorithm is very effective for data with many
features. Currently, there are two ways to solve classification problems by carrying out
SVM. The first one is to construct several binary classifiers and combine them together.
Another is to directly consider the parameter optimization of all classifiers simultaneously.
This can effectively avoid the neural network structure and local minima problems and
make progress in terms of performance. We set the best parameter by hinge loss function.
The equation is shown in (5).
The decision tree [31] is a non-parametric supervised learning method for classification
and regression. It is a collection of nodes that can make decisions for some feature connected
to certain classes. The purpose of the DT is to create a model that predicts the value of a
target variable by learning simple decision rules from indicators. DT is also the basis of RF
and XGBOOST which is a part of the tree. We built the network by information entropy.
The equation is shown in (6).
2
Ent( D ) = − ∑ pk log2 pk (6)
k =1
In Equation (6), k denotes that the number of type and pi means that the percentage of
the ith type. When the Ent(D) is higher, the purity of D is higher.
Entropy 2022, 24, 1157 8 of 17
Table 3. The pointbiserial correlation coefficient between indicators and dependent variable.
Pointbiserial
x1 x2 x3 x4 x5 x6 x7
Correlation Coefficient
y 0.908 0.839 0.491 0.665 0.829 0.989 0.552
x8 x9 x10 x11 x12 x13
y 0.757 0.432 0.769 0.605 0.116 0.737
The training data set contains about 12,642 record items, including 12,526 legitimate
records and 116 fraudulent records. As for the testing data set, it contains about 5418 records,
including 5354 legitimate records and 64 fraudulent records. After processing by using
SMOTE method, the new training data set and testing data set are shown in Table 4.
Before training models, we used five models with imbalanced data sets and balanced
data sets, details of results are depicted in Table 4. It is essential to show the importance
of the data set process when we use SMOTE in sampling. According to the performance
metrics, such as accuracy, recall, precision, and AUC, AUC plays an important role in the
fitting of different algorithms. It means that the higher the value, the better the model.
Thus, we calculated the AUC when we used the same models with two different data sets.
The results are shown in Table 5.
Entropy 2022, 24, 1157 11 of 17
Table 5. The performance of AUC in two different data set with same models.
From Table 5, it is clear that the value of AUC in the balanced data set when we applied
SMOTE to the original data set is higher than that in the imbalanced data set. In other
words, SMOTE is a good way to increase the accuracy when we need to predict and detect
companies which have committed financial fraud. At the same time, we reviewed previous
works on deep learning [41–43]. According to the results of these studies, machine learning
has better performance than deep learning in fraud detection. There are three reasons why
we did not conduct this research using deep learning.
• Deep learning is more effective in text, video, and image processing compared with
classification.
• The importance of each indicator is not clear, because neural networks pay attention
to the model framework. It is difficult to prevent and get information of the state of
the finances.
• When we conduct deep learning, we cannot know how to predict according to pa-
rameters and which indicators are essential for management because it is not easy to
explain the connections between neurons.
Thus, we implemented five machine learning algorithms in a balanced data set, such
as LR, RF, XGBOOST, SVM, and DT. In this phase, we implemented grid search which can
increase the accuracy of prediction and obtain results in a short time. Generally, different
models can choose different parameters, as shown in Table 1. The best parameters of each
algorithm after training are shown in Table 6.
Based on these results, we used these best models to predict the type of company with
training data set. These results are shown in Table 7.
Entropy 2022, 24, 1157 12 of 17
Algorithm TP FP FN TN
LR 3590 1773 15 40
RF 5362 1 55 0
XGBOOST 5358 5 55 0
SVM 5289 74 55 0
DT 5323 40 54 1
In Table 7, RF is the best model which can accurately predict legitimate company
behaviour. In addition, XGBOOST and DT have better performance compared with the
other models. However, these models also make some errors in predicting company
legitimacy. By contrast, SVM and LR made multiple errors in predicting legitimate or
fraudulent company. We also calculated the accuracy, recall, precision, and AUC, to
effectively evaluate and choose the best model. These results are shown in Table 8.
In Table 8, RF, XGBOOST, SVM, and DT have better performance in accuracy, recall,
and precision compared with LR. In terms of accuracy, all models at over 97%, excluding
LR model, for which the accuracy percentage was below 70%. At the same time, the
other four models are higher than 97%. In addition, the recall of all models is at about
99%. Then, the precision of these are in the range of from 98% to 100%, excluding the LR
model. In conclusion, these models can predict and detect the type of company effectively.
However, the AUC value of the five models ranged from 0.50 to around 0.75. We plotted
the AUC value as a bar chart to observe and select the best model as the basic model by
using Voting in the next step as shown in Figure 4.
As for the AUC of models, when it gets close to 1, it means that the model is pretty
good. It is clear that the AUC of logistic and XGBOOST model reached at the highest
between 0.7 and around 0.75, which represented an extremely good fitting. However, that
of DT is the lowest at around 0.5. In addition, we needed to consider these metrics and
decided to chose XGBOOST as the basic model. Then, we combined XGBOOST and Voting
with LR, RF, SVM, and DT. We also performed the same procedure as in the first step. The
results are shown in Table 9.
In Table 9, it can be seen that all performance metrics made progress in accuracy, recall,
precision, and AUC. For accuracy, SVM + XGBOOST is the lowest at 97.933% and that
of the other three models are higher than 98%. In addition, the recall for all models is
approximately 99%. Thus, the recall of these models is approximately 99%. In other words,
the four ensemble models have similar performance, even while there are some differences
between many of their performance metrics.
However, it is obvious that the AUC of ML models with Voting made an dramatic
improvement and we plotted a bar chart in which this can be clearly observed, shown in
Figure 5.
Figure 5. The AUC value of different single machine learning algorithms and ensemble algorithms.
For performance comparison, the AUC of ensemble models is higher than that of
a single model, e.g., that of models with Voting at over 0.7. Thus, the maximum AUC
difference is 0.268 in between and the average gap is about 0.167. It means that the latter
one can predict the class of company accurately and effectively.
Therefore, we used the AUC value as the main metric to choose the optimal model
due to the ensemble models performing similarly in terms of accuracy, recall, and precision.
We also plotted the ROC curve as shown in Figure 6.
Entropy 2022, 24, 1157 14 of 17
Figure 6 shows that LR+XGBOOST is the best model. As for accuracy, recall, and pre-
cision, the values are nearly 100%, respectively. This indicates that the proposed approach
is valid and reliable.
Next, a comparison analysis was conducted between the algorithms proposed in the
imbalanced data set and previous works, such as k-nearest neighbor (KNN), Multilayer
Perceptron (MLP), and Naive Bayes (NB). Other machine learning models built included
LR, RF, XGBoost, SVM, and DT. The results are shown in Table 10.
5. Conclusions
This paper proposed an intelligent model to predict and detect the type of company
based on the number of samples and various indicators. We carried out wrapper methods
for feature selection, including LR, RF, GBDT, DT, and XGBOOST. Next, these indicators
were selected by their frequency. In order to classify the type of company, we also built
single ML models and ensemble algorithms by using Voting. After building, we chose
XGBOOST as basic model and combined others for training. Finally, we evaluated with
several performance metrics, such as accuracy, recall, precision, and AUC.
We selected 13 indicators by using machine learning models and counting the fre-
quency of indicators. These indicators were divided into three categories, including the
cash flow, operating capacity, and profitability. Cash flow contained six indicators, such
as PUR_FIX_ASSETS_OTH, NOPERATE_EXP, C_PAID_TO_FOR_EMPL, INVENTORIES,
BIZ_TAX_SURCHFG, and N_CF_OPA_A which are the essential indicators that can eval-
uate the economic efficiency of projects. During manufacturing or working, managers
should pay attention to the outflow and income of cash instead of the size and number
of projects. Operating capacity included three indicators, such as CIP, MINORITY_INT,
and ASSETS_DISP_GAIN, indicating the ability of the business to increase income by
assets. When these indicators are higher, it indicates that the business is operating better.
Profitability contained RETAINED_EARNINGS, ADVANCE_RECEIPTS, BASIC_EPS, and
COMPR_INC_ATTR_M_S, which indicate the earning capacity. These can reflect not only
the operation of the company directly but also whether the capital structure of the company
is reasonable. Thus, these have a big impact on different phases of management of a
company. At the same time, we set the best parameter of each ML model. The ensemble
classifier using Voting showed significant improvement compared with the single model.
The best model was the LR+XGBOOST model. The accuracy, recall, and precision being
98.523%, 99.017%, and 99.497%, respectively. Its AUC reached the highest point at 0.794. It
means that this ensemble model can predict whether companies have committed financial
fraud efficiently and more accurately compared with others.
In conclusion, we designed a scientific method to detect financial fraud and these
parameters of machine learning algorithms are reasonable according to the results. This
means that this can identify companies which have financial problems with the best
accuracy. In addition, it is quite essential for employers working in the finance sector to
address doubts regarding financial reports or projects.
At present, the number of listed companies increases year by year due to the rapid
development of the economy. Machine learning can greatly reduce the work pressure
of staff in identifying whether there are financial fraud problems in listed companies.
Entropy 2022, 24, 1157 16 of 17
However, many issues still require deep research deeply as society continues to develop.
There are three points we need to follow.
• In terms of collecting data, we ought to collect data as much as possible.
• In term of model application, the financial fraud identification and prediction model
can be used not only in the stock market but also in company operations for oversight
of the company’s financial status at anytime and anywhere.
• In term of model validation, we plan to collect financial indicators of other listed
companies in various industries and apply the model to validate the model effect and
ensure its reliability by repeating the experiment.
Author Contributions: Conceptualization, Z.Z. and T.B.; methodology, Z.Z.; software, Z.Z.; vali-
dation, T.B.; formal analysis, T.B.; writing—original draft preparation, Z.Z.; writing—review and
editing, Z.Z. and T.B.; visualization, T.B.; funding acquisition, Z.Z. All authors have read and agreed
to the published version of the manuscript.
Funding: This project was supported by Special Funding Project for the Science and Technology
Innovation Cultivation of Guangdong University Students, pdjh2020a0748.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not Applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Reurink, A. Financial fraud: A literature review. J. Econ. Surv. 2018, 32, 1292–1325. [CrossRef]
2. Restya, W.P.D. Corrupt behavior in a psychological perspective. Asia Pac. Fraud. J. 2019, 4, 177–182. [CrossRef]
3. Treadway, J.C.; Thompson, G.; Woolworth, F.W. Comment letters to the National Commission on Commission on Fraudulent Financial
Reporting; Treadway Commission: New York, NY, USA, 1987; Volume 2.
4. Li, R.H. A study for establishing a fraud audit. Audit. Econ. Res. 2002, 17, 31–35.
5. Handoko, B.L.; Warganegara, D.L.; Ariyanto, S. The impact of financial distress, stability, and liquidity on the likelihood of
financial statement fraud. Palarch’s J. Archaeol. Egypt/Egyptology 2020, 17, 2383–2394.
6. Peng, Z. A Ripple in the Muddy Waters: The Luckin Coffee Scandal and Short Selling Attacks. Available online: https:
//ssrn.com/abstract=3672971 (accessed on 1 August 2020).
7. Li, Y. Research on the Effectiveness of China’s A-share Main Board Market. In E3S Web of Conferences; EDP Sciences: Les Ulis,
France, 2021; Volume 235.
8. Zhu, X.; Ao, X.; Qin, Z.; Chang, Y.; Liu, Y.; He, Q.; Li, J. Intelligent financial fraud detection practices in post-pandemic era.
Innovation 2021, 2, 100176. [CrossRef]
9. Mohammed, R.A.; Wong, K.W.; Shiratuddin, M.F.; Wang, X. Scalable machine learning techniques for highly imbalanced credit
card fraud detection: A comparative study. In Pacific Rim International Conference on Artificial Intelligence; Springer: Cham,
Switzerland, 2018; pp. 237–246.
10. Kaur, H.; Pannu, H.S.; Malhi, A.K. A systematic review on imbalanced data challenges in machine learning: Applications and
solutions. Acm Comput. Surv. (CSUR) 2020, 52, 1–36. [CrossRef]
11. Ganganwar, V. An overview of classification algorithms for imbalanced data set. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 42–47.
12. Neumann, J.; Schnörr, C.; Steidl, G. Combined SVM-based feature selection and classification. Mach. Learn. 2005, 61, 129–150.
[CrossRef]
13. Tang, J.; Alelyani, S.; Liu, H. Feature selection for classification: A review. Data Classif. Algorithms Appl. 2014, 37–64.
14. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182.
15. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [CrossRef]
16. Omar, N.; Jusoh, F.; Ibrahim, R.; Othman, M.S. Review of feature selection for solving classification problems. J. Inf. Syst. Res.
Innov. 2013, 3, 64–70.
17. Coelho, F.; Braga, A.P.; Verleysen, M. A mutual information estimator for continuous and discrete variables applied to feature
selection and classification problem. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [CrossRef]
18. Bell, T.B.; Carcello, J.V. A Decision Aid for Assessing the Likelihood of Fraudulent Financial Reporting. Audit. J. Pract. Theory
2000, 19, 169–184. [CrossRef]
19. Spathis, C.T. Detecting False Financial Statements Using Published Data: Some Evidence from Greece. Manag. Audit. J. 2002, 17,
179–191. [CrossRef]
Entropy 2022, 24, 1157 17 of 17
20. Kirkos, E.; Spathis, C.; Manolopoulos, Y. Data mining techniques for the detection of fraudulent financial statements. Expert Syst.
Appl. 2007, 32, 995–1003. [CrossRef]
21. Skousen, C.J.; Smith, K.R.; Wright, C.J. Detecting and Predicting Financial Statement Fraud: The Effectiveness of the Fraud
Triangle and SAS No. 99. Soc. Sci. Electron. Publ. 2008, 13, 53–81. [CrossRef]
22. Ravisankar, P.; Ravi, V.; Rao, G.R.; Bose, I. Detection of financial statement fraud and feature selection using data mining
techniques. Decis. Support Syst. 2011, 50, 491–500. [CrossRef]
23. Glancy, F.H.; Yadav, S.B. A computational model for financial reporting fraud detection. Decis. Support Syst. 2011, 50, 595–601.
[CrossRef]
24. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
25. Abdoh, S.F.; Rizka, M.A.; Maghraby, F.A. Cervical Cancer Diagnosis Using Random Forest Classifier with SMOTE and Feature
Reduction Techniques. IEEE Access 2018, 6, 59475–59485. [CrossRef]
26. Ileberi, E.; Sun, Y.; Wang, Z. Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using
SMOTE and AdaBoost. IEEE Access 2021, 6, 165286–165294. [CrossRef]
27. Dreiseitl, S.; Ohno-Machado, L. Logistic regression and artificial neural network classification models: A methodology review. J.
Biomed. Inform. 2002, 35, 352–359. [CrossRef]
28. Speiser, J.L.; Miller, M.E.; Tooze, J.; Ip, E. A comparison of random forest variable selection methods for classification prediction
modeling. Expert Syst. Appl. 2019, 134, 93–101. [CrossRef]
29. Ramraj, S.; Uzir, N.; Sunil, R.; Banerjee, S. Experimenting XGBOOST algorithm for prediction and classification of different data
sets. Int. J. Control. Theory Appl. 2016, 9, 651–662.
30. Bhavsar, H.; Panchal, M.H. A review on support vector machine for data classification. Int. J. Adv. Res. Comput. Eng. Technol.
2012, 11, 185–189.
31. Song, Y.Y.; Ying, L.U. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130.
32. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
33. LogisticRegression. Available online: https://scikit-learn.org/stable/modules/classes.html (accessed on 29 March 2022).
34. RandomForestClassifier. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 29 March
2022).
35. SVC. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 29 March 2022).
36. DecisionTreeClassifier. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 29 March 2022).
37. Kabari, L.G.; Onwuka, U.C. Comparison of bagging and voting ensemble machine learning algorithm as a classifier. Int. J. Adv.
Res. Comput. Sci. Softw. Eng. 2019, 9, 19–23.
38. Randhawa, K.; Loo, C.K.; Seera, M.; Lim, C.P.; Nandi, A.K. Credit card fraud detection using AdaBoost and majority voting. IEEE
Access 2018, 6, 14277–14284. [CrossRef]
39. Taha, A.A.; Malebary, S.J. An intelligent approach to credit card fraud detection using an optimized light gradient boosting
machine. IEEE Access 2020, 8, 25579–25587. [CrossRef]
40. Khamis, H. Measures of association: How to choose? J. Diagn. Med. Sonogr. 2008, 24, 155–162. [CrossRef]
41. Mehbodniya, A.; Alam, I.; Pande, S.; Neware, R.; Rane, K.P.; Shabaz, M.; Madhavan, M.V. Financial fraud detection in healthcare
using machine learning and deep learning techniques. Secur. Commun. Netw. 2021, 2021, 9293877. [CrossRef]
42. Gupta, R.Y.; Mudigonda, S.S.; Baruah, P.K. A comparative study of using various machine learning and deep learning-based
fraud detection models for universal health coverage schemes. Int. J. Eng. Trends Technol. 2021, 69, 96–102. [CrossRef]
43. Mathew, A.; Amudha, P.; Sivakumari, S. Deep learning techniques: An overvie. In International Conference on Advanced Machine
Learning Technologies and Applications; Springer: Singapore, 2020; pp. 599–608.