Dissertation on Fraud Detection Using Machine Learning
Dissertation on Fraud Detection Using Machine Learning
BY
LUKUI HUANG
DISSERTATION
BY
LUKUI HUANG
ENTITLED
on July 6, 2023
Chairman
(Assistant Professor Richard Gruss, Ph.D.)
ABSTRACT
ACKNOWLEDGEMENTS
Lukui Huang
TABLE OF CONTENTS
Page
ABSTRACT (1)
ACKNOWLEDGEMENTS (3)
CHAPTER 1 INTRODUCTION 1
3.2.4 Labeling 17
3.3 Experimental results 18
3.3.1 Experimental setup 18
3.3.2 Evaluating metrics 20
3.3.3 Experimental results and analysis 22
3.4 Supplement analysis 25
3.4.1 The size effect of the training sets 25
3.4.2 Feature importance of cost-sensitive cascade forest 26
3.4.3 The effect of more financial variables 27
3.5 Conclusion 30
3.5.1 Implications for Research 31
3.5.2 Implications for Practice 31
3.5.3 Limitations 32
3.5.4 Future Work 32
REFERENCES 65
APPENDICES
APPENDIX A 75
APPENDIX B 76
APPENDIX C 77
APPENDIX D 78
APPENDIX E 79
BIOGRAPHY 81
LIST OF TABLES
Tables Page
3.1 Descriptive statistics of selected financial variables 15
3.2 Sample selection 17
3.3 Distribution of fraud firms over 1991-2014 18
3.4 Experimental results 23
3.5 Feature importance of the top-10 financial variables and the related 27
accounts affected by the detected frauds
3.6 Experimental results using 187 variables 29
4.1 Prevalence of accounting control issues in random sample of 37
employee reviews
4.2 Top 20 unigrams retrieved by Correlation Coefficient score 45
4.3 Experimental results 48
4.4 MARS Metric for Principle 13-15 50
4.5 MARS Metric for Principle 3 51
4.6 Ranking quality evaluation 54
LIST OF FIGURES
Figures Page
3.1 Illustration of the workflow for the main experiments 20
3.2 Effect of training size on models’ AUC with missing value 26
treatments
3.3 AUC vs Time per Layer (in seconds) for various feature set sizes 30
4.1 Illustration of the workflow for the dataset 39
4.2 Accumulated true positive in top ranked reviews for Principle 13-15 55
4.3 Accumulated true positive in top ranked reviews for Principle 3 55
4.4 Web application architecture 58
4.5 Web application components 58
4.6 Main Page layout and layout for selecting control issue type 60
4.7 Layout for selecting a machine learning prediction model 61
4.8 Layout for file upload and prediction preview 62
CHAPTER 1
INTRODUCTION
CHAPTER 2
BACKGROUND AND LITERATURE REVIEW
2.1.1 Background
Firms may intentionally distort their financial statements to appear
more attractive, unlike unintentional errors. Such fraud is deliberately obscured and
thus challenging to detect, even for professionals (Earley, 2015; Warren Jr et al.,
2015). A report by the world's largest anti-fraud organization, the Association of
Certified Fraud Examiners (AFCE), found that financial fraud caused over $3.6
billion in damage from January 2018 to September 2019. Only 15% of fraud cases
were detected by internal auditors and 4% by external auditors (AFCE, 2020). To
improve detection, auditing standards (e.g., the Statement of Auditing Standards
No.99: Auditors' Responsibility for Fraud Detection) require more proactive auditing.
Data mining, which uncovers insights from large data, has advanced capabilities to
classify and predict. Given data mining's popularity and availability, accounting firms
have used it in decision support systems to provide early warnings of financial
statement fraud (Hajek & Henriques, 2017).
Prior research has investigated the utility of various data mining
techniques to aid auditors in fraud detection (Ngai et al., 2011; West & Bhattacharya,
2016). These include decision trees(Kotsiantis et al., 2006), support vector machines
(Cecchini et al., 2010b), neural networks (e.g., Fanning & Cogger, 1998; Lin et al.,
2015), evolutionary algorithms(e.g., Alden et al., 2012; Hoogs et al., 2007), ensemble
models (e.g., Bao et al., 2020; Hajek & Henriques, 2017), and text mining (e.g., Goel
& Gangolly, 2012; Goel & Uzuner, 2016; Throckmorton et al., 2015). Although some
studies used financial, linguistic, and textual data (e.g., Craja et al., 2020; Hajek &
Henriques, 2017), findings on whether linguistic or textual data provide extra
information are inconsistent and depend on the data selected. However, studies using
financial data consistently show that financial data, especially with advanced data
mining, provides useful information for fraud detection.
higher misclassification costs for that class. Empirical studies in various domains have
demonstrated the superiority of cost-sensitive learning techniques over sampling
methods (e.g., Liu & Zhou, 2006; Zhou & Liu, 2005).
In the field of financial fraud detection, researchers have
examined algorithms with cost-sensitive concept, such as weighted SVM, cost-
sensitive Naïve Bayes, and KNN, to address high class imbalance (Moepya et al.,
2014; Twala & Nelwamondo, 2017). Additionally, the financial misstatement
prediction model proposed by Kim et al. (2016) employed MetaCost, a data
preprocessing technique to reflect asymmetric misclassification costs, to minimize
cost during the training. However, it is worth noting that the use of cost-sensitive
classification for imbalanced learning remains limited in the financial statement fraud
literature.
2.1.2.3 Dealing with missing data
Missing data is a common challenge in fraud research.
Research from the data mining literature offers various solutions to address this issue
(Feelders, 1999; Twala, 2009). Surprisingly, only a few existing fraud studies have
thoroughly discussed this issue and the impact of different missing value treatment on
these experimental results largely remain unknown. In early fraud studies like
Cecchini et al. (2010a), Dechow et al. (2011), and Perols (2011), complete case
analysis is employed to handle missing data. This method involves excluding entire
firm-observations that contain missing values. Other approaches in fraud research
adopt statistical techniques such as mean and zero imputation, or machine learning
techniques like support vector machines to handle missing values. For instance, Perols
et al. (2017) replace missing values with global means during data preprocessing for
fraud prediction. Walker (2020) impute missing values as zero, and partially the same
approach is employed in Bao et al. (2020) before inputting the data into their
proposed fraud detection models. Craja et al. (2020) and Hajek & Henriques (2017)
employ various machine learning algorithms to estimate missing values based on the
available data.
Complete case analysis, which excludes the entire records
with missing values, is a crude approach to handling missing data. It can introduce
bias and discard useful information. Alternatively, employing statistical techniques
that substitute missing values with zero, mean, or median can enhance performance.
However, the improvement depends on input data characteristics like variance and
skewness. More advanced machine learning techniques impute missing values by
estimating them from available data (Batista & Monard, 2003; Howell, 2007). In
short, the benefits of different missing data treatments depend on the problem domain,
data patterns, and dataset size (Allison, 2001).
2.1.2.4 Summary and discussion of prior research
In summary, the selection and structure of fraud predictors
should be determined by the prediction algorithm employed. Though undersampling
is common to handel class imbalance issue in fraud studies, it often lacks out-of-
sample testing and thus brings generalizability conerns. Research using large datasets
to enable realistic out-of-sample testing is lacking. The impact of various missing
value treatments on the fraud prediction models is also very limited in the fraud
literature, especially for cost-sensitive methods.
2.2.1 Background
Internal control has long been recognized as a fundamental process
assuring the reliability of accounting and financial information. The Public Company
Accounting Oversight Board (PCAOB) defines a material weakness in internal
control (ICMW) as, “a deficiency, or a combination of deficiencies, in internal control
over financial reporting, such that there is a reasonable possibility that a material
misstatement of the company’s annual or interim financial statements will not be
prevented or detected on a timely basis ” Weak internal control is linked to a higher
risk of fraudulent activities (e.g., employees misappropriating assets) as it provides
opportunities for managers or other employees to override the control for their self-
interests (Donelson et al., 2017). Disclosure of internal control weakness may damage
investors’ confidence in corporate governance (Johnstone et al., 2011). It may raise
customers’ concerns about the product/service’s quality, eventually leading to a
decline in sales growth (Su et al., 2014). In addition, firms are also more likely to file
for bankruptcy if their internal controls are found or disclosed with material weakness
(Doyle et al., 2007). Therefore, responding to the demand for high internal control
quality, the Sarbanes-Oxley Act of 2002 (SOX) mandates and specifies
management’s (Section 404 (a)) and auditors’ (Section 404 (b)) responsibility in
reporting their evaluation on the effectiveness of internal controls over financial
reporting (ICFR).
To form an opinion of the effectiveness of ICFR, auditors are also
required to plan the audit and collect an appropriate amount of audit evidence
according to the preliminary evaluation of their clients’ internal control (PCAOB
2007). The extent of substantive testing of financial statement accounts and
disclosures often relies on the effectiveness of ICFR. Therefore, the evaluation of
ICFR is essential for an audit. However, regulators (e.g., PCAOB) have expressed
great concerns about the ICFR audit deficiencies in practices (Franzel, 2015), as an
upward trend is observed in the percentage of firms with ineffective ICFR and
material weakness (Calderon et al., 2016), and ICFR audit deficiencies are the most
frequent causes of their annual inspections (Franzel, 2015). From a practical
perspective, therefore, there is a high demand for a more effective evaluation of ICFR
and identification of ICMW for investors, auditors, regulators, and other stakeholders.
The importance of control effectiveness can also be inferred from
the extensive literature regarding internal controls. Various ICMW prediction models
using firm characteristics, financial indicators, and external factors have been
proposed in existing literature (e.g., Ashbaugh-Skaife et al., 2007). Taking advantage
of the latest development in data mining techniques, researchers have begun to turn to
new types of data sources (e.g., conference calls) for incremental information to
enhance ICMW identification (e.g., Nasir et al., 2021; Rich et al., 2018). However,
these research studies are all based on corporate disclosure (e.g., financial reporting,
conference calls). The main issue of using corporate disclosure, voluntarily or
mandatory, to infer or predict internal control implementations is that this information
is provided by the firms, making disclosure information subject to the possibility of
being managed or even manipulated. Data sources that the firm cannot manage may
tell a different story. Therefore, these alternative data sources –outside the direct
management of the firm- are valuable to mine. Study 2 focuses on online employee
reviews as a new type of data source to infer accounting control practices in a firm,
which has not been examined in the literature yet.
conference calls and MD&A. These studies assume that management is aware of the
ICMW and may communicate it with stakeholders unconsciously or deliberately.
While this assumption is plausible, under certain circumstances (such as fraud),
management may be motivated to conceal their private information. In such cases,
management, aware of machine processors or other advanced data analytics, may
change their disclosure strategy to manipulate their audience. For example, Cao et al.
(2020)show that when the managements know their disclosure readers are machines,
they will manage the textual sentiment and audio emotion in disclosure. Thus, an
analysis of management disclosure may not be as worthwhile as before, even with the
latest data mining techniques. Alternative data sources – such as public employee
reviews – that are out of the control of management become more critical and may
contain more valuable information compared to the traditional ones.
Job review websites (e.g., Glassdoor, Indeed, Vault, etc) usually
require reviewers to leave anonymous comments about their firms. Since employers
are not allowed to delete or change employee reviews, this study assumes that
employee opinions from the job review websites reflect employees’ genuine thoughts
about their employers. These reviews can be a rich private information source about
employees’ working conditions and opinions (Teoh, 2018). As a new type of data,
they are used to understand and re-construct corporate culture (Corritore et al., 2020),
which is then linked to research relating to organizational misconduct (Aswani &
Fiordelisi, 2020; Cavicchini et al., 2021), firm performance (Guiso et al., 2015), and
financial reporting risk (Ji et al., 2017). Other studies treat employee reviews as a
source to provide incremental information for stock returns prediction (Green et al.,
2019), future corporate disclosure (Hales et al., 2018), corporate misconduct
prediction (Campbell & Shang, 2021), and financial misconduct risks evaluation
(Zhou & Makridis, 2019). The majority of existing research uses the readily available
firm ratings to proxy employee opinions (e.g., Campbell & Shang, 2021), with a few
exceptions performing an natural language processing (NLP) analysis of the firms’
employee reviews (e.g., Cavicchini et al., 2021).
However, online employee reviews have not yet been used to infer
the implementation of internal controls. While top management is responsible for
planning, implementing, and supervising internal controls, employees at all levels are
the main players in the control process. The working environment described by online
employee reviews is also the control environment of a firm. Thus, these reviews may
contain information about the firm's compliance with control policies and procedures,
as well as control deficiencies in design or operations.
In summary, although existing studies have shown the benefits of
utilizing Big Data and data analytics for the early detection of internal control
material weakness, these prior works suffer from using only data disclosed by the
firms (e.g., annual reports, conference calls). As management may employ a counter-
strategy to these advanced data analytics techniques and disclose only self-serving
information, analyzing data from sources that are out of the control of management
becomes potentially more valuable. Online employee reviews thus become worthy of
investigation to infer the control practices in a firm.
CHAPTER 3
STUDY 1: FINANCIAL FRAUD DETECTION
1
The default setting of CF has two estimators and each estimator consists of one random forest (RandomForest)
and one completely random forest (ExtraTrees).
(1)
2
An alternative approach following the idea of cost-sensitive learning is to change the class weight in the decision
trees of the forests when calculating the “impurity” score of a selected split point Bootstrap Class Weighting was
employed as it gave better results. Class Weighting and Bootstrap Class Weighting can be achieved by setting the
class_weight argument to the corresponding value in each base classifier (forest) in Python.
3
https://www.spglobal.com/marketintelligence/en/?product=compustat-research-insight
4
This study defines the financial firms as the Financial Services in Compustat, which includes banks, insurance
companies, broker/dealers, real estate and other financial services.
predictive value, this study imputes missing values to balance using available data and
introducing noise. Experiments were done in Python, and the sample selection process
is shown in Table 3.2.
5
A more aggressive cutoff line for deleting observations with missing values is 50% in the work of Craja et al.
(2020).
3.2.4 Labeling
This study relies on the disclosure of Accounting and Auditing
Enforcement Releases (AAERs) by the U.S. Securities and Exchange Commission
(SEC) to identify fraudulent financial reports for the following reasons: First, AAERs
have been widely used in research which has high requirement for the reliability of
fraud identification, such as for identifying fraudulent firms (e.g., Craja et al., 2020;
Fanning & Cogger, 1998; Feroz et al., 2000; Humpherys et al., 2011; Purda &
Skillicorn, 2015) and for misconduct research (Karpoff et al., 2017). The SEC applies
enforcement actions to significant fraudulent activities supported by substantial and
robust evidence. Thus, the AAER database is considered a trustworthy and consistent
source for accurately identifying fraudulent firms. Second, AAERs provide
information about enforcement actions taken against auditors, companies, or
individuals for violations of SEC and federal rules. These releases contain details
about the entities involved, the nature of misconduct, and the impact of such
misconduct on financial statements. Thus, choosing AAER as fraud proxy opens more
opportunity for future research.
In this study, the AAER database provided by the USC Marshall
School of Business is used to label fraudulent firms 6 . All firm-observations not
labeled as fraudulent in the database are considered non-fraudulent firms. Since the
number of labeled fraud firms in the AAER database is very limited before 2015 (less
6
USC Marshall School of Business provides AAER database compiled by Dechow, P. M., et al. (2011)
than 5), the year 2014 is chosen as the cutoff for testing the proposed models. The
distribution of fraud firms can be found in Table 3.3, providing an overview of the
fraudulent cases within the dataset.
Table 3.3: Distribution of fraud firms over 1991-2014
Section 3.3.1 describes the sample period and parameter settings for
the main experiments. Section 3.3.2 discusses the performance evaluation metrics,
and Section 3.3.3 presents experimental results.
years for fraud to be disclosed early. From Table 3.3, the number of fraud cases
reported in AAER has shown a noticeable decline since 2001. However, Donelson et
al. (2021) present different findings (Table 1, Panel C) regarding the number of
financial reporting fraud cases identified based on private enforcement during the
same period. This suggests a high possibility of false negatives in labeling fraud,
indicating that the percentage of undetected fraud may have increased since 2001.
Considering the decline in regulators' enforcement of accounting fraud since the 2008
financial crisis, this study ends the test period in 2008 to align with Bao et al. (2020)
for comparison purposes. It is important to note that due to undetected fraud, the
optimal test period for performance evaluation is considered to be from 2003 to 2005.
Therefore, this study presents average results for both the 2003-2005 and 2003-2008
periods to assess the robustness of the prediction models.
3.3.1.2 Parameter settings
The experiments in this study utilized two specific classifiers7:
the "RUSBoostClassifier" from the imbalanced-learn Python Toolbox and the
"CascadeForestClassifier" from the Deep Forest Python package 8. Grid search was
employed to find the optimal settings for the two hyperparameters required in the
CascadeForestClassifier: the number of forests (the number of estimators in each
cascade layer) and the number of trees in each forest (the number of trees in each
estimator).The resulting number of estimators is set to be 2 and the number of trees in
each estimator is 100. To achieve bootstrap class weighting for cost-sensitive learning,
the class_weight argument was set to 'balanced_subsample' in each estimator. 5-fold
cross-validation was applied to each of the four base classifiers to alleviate the
overfitting issue. The experimental workflow is summarized in Figure 3.1.
7
Note that the data preprocessing does not include variable scaling (e.g. normalization). Both RUSBoost and the
proposed model are decision-tree based ensemble learning algorithms. As the learning of a decision tree is based
on rules, both algorithms are not sensitive to variable scaling. Firm observations were fed to the models after
preprocessing as in Section 4.
8
Documentation of Python package “Deep Forest”: https://deep-forest.readthedocs.io/en/latest/index.html
9
Bao, Ke et al. (2020) argue that the setting of threshold k can be important for NDCG. This study performed the
sensitivity analysis of NDCG’s threshold setting at 0 5%, 1%, 1 5% and 5% Experimental results show that a
higher threshold does not always lead to higher NDCG score, suggesting the NDCG score is insensitive to the
setting of k in the experiments.
Fig 3.2 Effect of training size on models’ AUC with missing value treatments
corresponding account category identified by Dechow et al. (2011) and mapped to its
frequency ranking as reported in Bao et al. (2020).
Interestingly, there is a noteworthy overlap between the top-10
financial variables listed in Table 3.5 and the account categories most impacted by
misstatements. Specifically, the four most influential predictors of fraud, namely "sale
of common and preferred stock," "common shares outstanding," "price close, annual,
fiscal," and "common/ordinary equity, total," are all associated with the account
category "inc_exp_se." It is important to note that "inc_exp_se" encompasses a
mixture of affected accounts that defy classification as income, expense, or equity
accounts. This compelling evidence indicates that the proposed model demonstrates
the capability to identify crucial financial variables that are frequently influenced by
misstatements.
Table 3.5 Feature importance of the top-10 financial variables and the related accounts
affected by the detected frauds
Related
Average score of Rank (relative frequency) of
indicator
feature importance the indicators affected by the
Rank Raw financial variables variables in
for the test period detected frauds over the test
Dechow et
2003-2008 period 2003-2008
al.(2011)
Sale of Common and
1 0.0566 inc_exp_se 1 (31.8%)
Preferred Stock
Common Shares
2 0.0366 inc_exp_se 1 (31.8%)
Outstanding
3 Price Close, Annual, Fiscal 0.0359 inc_exp_se 1 (31.8%)
Common/ Ordinary Equity,
4 0.0334 inc_exp_se; res 1 (31.8%); 5 (5.5%)
Total
Cash and Short-Term
5 0.0326 asset 3 (11.8%)
Investments
6 Current Assets, Total 0.0314 asset 3 (11.8%)
7 Sales/turnover (net) 0.0310 rev 2 (23.9%)
Property, Plant, and
8 0.0308 asset 3 (11.8%)
Equipment, total
9 Receivables, Total 0.0306 rec 4 (8.6%)
10 Inventories, Total 0.0306 inv 5 (5.5%)
The definition of indicators variables is shown in Appendix A.
applying RUSBoost to a complete set of raw financial variables (including 266 raw
data items) drew a negative response to this question. As discussed in section eariler,
RUSBoost in Matlab addresses missing data using a decision tree mechanism.
Consequently, the algorithm relies heavily on the available data for prediction,
introducing a bias towards the available data. Thus, such bias may be exaggerated
when the scale of missing values is large. Numerous studies in the field of data
mining have demonstrated that statistical or supervised learning imputation methods
outperform surrogate splits in tree-based learning. This effect is particularly
pronounced when the training data exhibits a high incidence of missing values (Saar-
Tsechansky & Provost, 2007; Twala, 2009).
To further investigate the question, the study follows the same
procedures as Bao et al. (2020) in relevant section using both RUSBoost and CSCF
with different missing value treatment. Data are obtained from Compustat's
"Fundamentals Annual" database10. A list of 187 financial items is used as feature
inputs after excluding variables with more than 25% missing values. For simplicity, a
straightforward approach of replacing missing values with zeros was utilized. The
experimental results are presented in Table 3.6.
Table 3.6 provides empirical evidence that supports the concern
regarding the misleading outcomes caused by the inherent splitting mechanism in
Matlab's RUSBoost algorithm. In Panel A, the results are significantly poorer when
using the same algorithm with zero imputation for missing data, casting doubt on the
conclusion that incorporating additional financial variables does not enhance fraud
prediction. However, when zero imputation is applied in Panel B and C, both models
demonstrate improved performance, with the proposed CSCF consistently
outperforming RUSBoost, aligning with the earlier findings.
10
Fundamentals Annual provides a comprehensive set of annual fundamental data including 710 financial
statement items and 131 miscellaneous and supplemental items.
11
The computational environment in this study is: Intel(R) Core(TM) i7-10700KF [email protected]; RAM 32GB; NVIDIA
GeForce RTX 3070.
Fig 3.3 AUC vs Time per Layer (in seconds) for various feature set sizes
3.5 Conclusion
3.5.3 Limitations
The primary limitation of this study lies in the use of imperfect
fraud proxy measures, which pose challenges in accurately representing the
underlying construct of fraud. The performance of fraud detection models can vary
depending on the choice of fraud proxy, highlighting the need for further research in
developing more robust and comprehensive fraud proxies.
Another limitation of this study is the exclusive use of structured
data from public financial statements for the fraud detection model. It does not
consider the potential insights that may be derived from other forms of structured or
unstructured data, such as employee messages. Incorporating additional data sources
could provide valuable information for enhancing the prediction of fraud. Exploring
the integration of diverse data types in future research could lead to more
comprehensive and accurate fraud detection models.
CHAPTER 4
STUDY 2: ACCOUNTING CONTROL ISSUES IDENTIFICATION
corporate culture (e.g., Corritore et al., 2020) and workplace environment (e.g., Li,
2022).
As every employee has responsibilities to take in an internal control
system, this study proposes that analyzing employee reviews may provide a glimpse
of a firm's internal control practices, through which practitioners and researchers may
benefit from obtaining an early signal of control problems for their decision-making.
Thus, the objective of Study 2 is to investigate the possibility of utilizing
online employee reviews as an alternative information source to indicate the internal
control practices in the companies. Upon exploring and confirming the potential of
using online employee reviews for the purpose of indicating internal controls (Section
4.1), an exclusive online employee reviews dataset is developed (Section 4.2), which
is then used to train the machine learning classifiers for identifying reviews with
mentions of control issues (Section 4.3). The current chapter is organized as follows:
Firstly, Section 4.1 explored the online employee reviews collected. It
verified the hypothesis that online employee reviews contain information that can be
used to infer accounting control issues, which serves as a prerequisite for subsequent
analysis. Section 4.2 describes the development of the online employee reviews
dataset. It includes how the data were collected, coded, cleaned, and verified. In
Section 4.3, the dataset created in the previous section (Section 4.2) was utilized for
training the classifiers to identify employee reviews with mentions of control issues.
An in-depth comparative analysis of the performance of a variety of machine learning
models on the issue was conducted, based on which a web application was developed.
The introduction of the app architecture, functionalities, and examples of webpage
layout can also be found in the same subsection. The last subsection (Section 4.4)
discusses the impact of the application and future work that can be undertaken
following the current study.
issues (e.g., reported material weakness) are expected to exhibit a higher prevalence
of control-related problems in their employee reviews compared to those with minor
control issues (e.g., without reported material weakness). This comparison of the
prevalence of control-related problems in online employee reviews between firms
with and without reported material weakness serves as a prerequisite for the
subsequent analysis in Study 2.
The Internal Control Integrated Framework (2013) issued by the
Committee of Sponsoring Organizations of the Treadway Commission (“The COSO
framework”) is the most recognized guidance for internal control implementations
The COSO framework was therefore used as the basis for coding the dataset.
The data exploration began by consulting the 17 accounting control
principles of the COSO framework and comparing the proportion of relevant
employee reviews from two groups of sampled companies. Only principles of the
COSO framework whose implementation may be inferred directly from employees’
self-reports of working experiences were chosen. For example, an employee reporting
an observation of ethical violations (e.g., bribery) would relate to Principle 1
“integrity and ethical values ” There are a total of 9 principles identified. The topics
were extracted from the reviews and linked to the identified principles, as shown in
Table 4.1 (next page).
18 listed U.S. companies in different industries that have reported
internal control material weakness in the Audit Analytics Internal Control (AAIC)
database 12 were randomly selected. In order to also have data on firms without
reported internal control material weaknesses, the 18 selected companies were
matched with 18 other firms that had the closest asset size in the same industry but did
not have material weakness reports in AAIC. Appendix B shows the sampled
companies. Employee reviews from before the year in which material weaknesses
were reported were gathered for the selected companies and their respective matches.
To ensure the protocols are written in alignment with COSO, the author drafted
protocols for each principle, which were later reviewed and revised by a full auditing
professor.
12
https://www.auditanalytics.com/products/sec/internal-controls
Table 4.1: Prevalence of accounting control issues in random sample of 200 employee reviews from balanced set of 36 companies (18
companies with, and 18 companies without internal control material weakness)
4.2 A dataset of online employee reviews for identifying accounting control issues
13
https://pamtag.pamplin.vt.edu
reliability amongst undergraduate student coders on Principle 3, 4 and 1114, only the
undergraduate group coding for Principle 13-15 “Information and communication”
(34 coders) was kept, and the senior authority coders labeled the full sample of
reviews to detect mentions of Principle 315.
For Principle 3: “Organizational structure, authority, and
responsibility”, each of the two authority coders (the author and a senior
collaborator16) labeled 2,000 reviews. Cohen’s kappa (κ) (Cohen, 1960)was used to
evaluate inter-rater reliability. κ =0.650 between the two authority coders was
observed, suggesting substantial inter-rater agreement, according to Landis and Koch
(1977). A full professor in auditing served as a third authority coder and labeled the
disagreements. The final labels represent the consensus of all three authority coders
after discussion. The coding process identified 194 reviews with mentions of Principle
3 Organizational structure, authority, & responsibility.
For Principles 13-15: “Information and communication”, the labels
created by the student coders were carefully reviewed. Students coders with total
positive mentions (of Principle 13-15) that were two or more times the standard
deviation away from the mean of the number of positive mentions (of Principle 13-
15), were considered rogue coders. A total of six rogue coders were removed. Another
two coders were disqualified because they failed to follow the protocol, which
required capture (copy-and-paste) of the specific paragraph(s) describing the
identified control issue mentioned in the review. After eliminating rogue and
disqualified coders, twenty-six (26) coders remained, with 8,195 reviews coded.
Since the labeling (coding) is randomly assigned in Pamtag, some
reviews were coded more than once. Following the examples of prior research
(Abrahams, Fan, Wang, Zhang, & Jiao, 2015;Abrahams, Jiao, Wang, & Fan,
14
Even though one group of student coders were assigned to code for mentions of Principle 11 and 13-15
(Information and communication), these two types of control issues were treated separately when calculating the
inter-rater reliability. Principle 11 was dropped because of unsatisfactory inter-rater reliability amongst
undergraduate student coders, for the same reasons as for Principle 3 and 4.
15
Each of the two authority coders coded 1000 reviews for Principle 4. However, only 66 unique positive records
were identified. Due to budget constraints, Principle 4 was dropped from the dataset.
16
The collaborator is an assistant professor in the field of auditing who received her Master's degree and PhD from
universities in the UK and Hong Kong, respectively.
2012;e.g., Goldberg & Abrahams, 2018), the labels were determined by the majority
vote for reviews coded more than twice. A conservative strategy is applied for the
reviews coded twice: that is, if one of the two coders of a single observation had
observed an accounting control issue, that coder’s label was taken as overriding the
coder who did not believe an accounting control issue was mentioned.
Each of the two authority coders coded 1,300 reviews. The inter-
rater reliability between the two authority coders is κ=0.633, indicating “substantial”
according to Landis & Koch (1977). A third authority coder coded the reviews with
disagreements. All three authority coders discussed the disagreements, and a
consensus was reached to form the final labels for the records coded by authority
coders. Next, the records of positive mentions were merged with the ones coded by
student coders. The inter-rater reliability between authority and student coders is κ
=0.601, regarded as moderate agreement by Landis and Koch (1977). The resulting
dataset contains 669 identified reviews with mentions of Principle 13-15 “Information
and communication”.
The dataset can be accessed via a Mendeley Dataset Repository at:
https://data.mendeley.com/datasets/zw2shn7pv3/1
Section 4.1 showed that online employee reviews could provide vital
insights into internal control practice in a firm, at least for certain, isolated principles
(specifically, Principles 3 and 13-15). However, determining whether a single
employee review suggests potential accounting control issues requires considerable
professional judgment, and the volume of contents can make identifying relevant
reviews difficult and tedious.
This study developed a web-based automated control issue detection tool
for users to identify (from a restricted list: Principles 3 and 13-15, again) online
employee reviews with mentions of relevant control issues. Users can upload their
collected reviews, select the type of control issues to identify, specify the predictive
models of their choice, and download the results. The web application will automate
the classification of the uploaded file and directly rank the reviews in descending
target class of interest and infrequent otherwise. Compared to the alternative machine
learning approaches, such as sentiment analysis and deep learning word embedding
models, the most substantial advantage of smoke terms is their interpretability. Smoke
terms can assist the decision-makers in understanding the model’s process with
emphasized prevalent words or phrases. Originating from research efforts to mine
product defect-related information from online consumer reviews (e.g., Abrahams et
al., 2012), the smoke-term technique has been applied to mine safety concerns-related
information from online consumer reviews (e.g., Abrahams et al., 2013; Abrahams et
al., 2012; Winkler et al., 2016) service quality detection in hospital (Zaman et al.,
2020), and mortgage origination delay detection in financial services (Brahma et al.,
2021; Goldberg et al., 2022).
Fumeus, the smoke term analysis tool developed by Goldberg,
Gruss et al. (2022) is utilized to develop the web-based application. There are two
core functions in Fumeus: smoke term generation and smoke term scoring. The
“smoke term generation” function generates a weighted dictionary of smoke terms
based on the given dataset, which then can be used to compute the smoke term scores
for unseen records in a textual dataset The “smoke term scoring” function returns a
file with records sorted by the smoke term scores in descending order, as well as
corresponding smoke term scores and a list of all smoke terms for each record. There
are two hyper-parameters to be set in Fumeus: the length of the smoke terms
(unigrams, bigrams, or trigrams) and the information retrieval metrics (“Correlation
Coefficient”, “Robertson’s Selection Value”(Robertson, 1986), “Relevance
Correlation Value” (Fan et al., 2005), and “Document and Relevance
Correlation”(Fan et al., 2005) to derive smoke terms.
the firms17. As illustrated above, the smoke terms well explain and communicate the
retrieval rationale of the method.
17
The protocol for labeling the reviews with mentions of Principle 13-15 is “Sufficient and relevant information is
identified, captured and provided to the right people in a timely manner, enabling them to take action to meet
objectives and mitigate risks. Open lines of communication, upstream, downstream, cross functionally, and
externally, promote understanding and acceptance of values and assist the division/department in meeting their
objectives”, as presented in Appendix E.
Accuracy (0.76%). The model with Word2Vec as the word embedding produces the
highest Recall value (72%), indicating that it correctly identifies 72% of the true
positive examples in the dataset.
Table 4.3 Experimental results
Panel A: Principle 13-15
Control code Classifier AUC Recall Precision Accuracy
Word2Vec 0.87 86% 77% 80%
GloVe 0.90 86% 84% 85%
BERT 0.73 42% 79% 65%
GPT2 0.95 91% 93% 92%
Principle 13_15 Logistic
0.80 84% 77% 80%
regression
SVM 0.89 87% 79% 82%
Decision tree 0.79 83% 78% 79%
Random
0.90 87% 81% 83%
forest
Panel B: Principle 3
Control code Classifier AUC Recall Precision Accuracy
Word2Vec 0.74 72% 63% 65%
GloVe 0.71 67% 67% 67%
BERT 0.82 67% 81% 76%
GPT2 0.69 23% 60% 53%
Principle 3 Logistic
0.58 64% 57% 58%
regression
SVM 0.52 64% 56% 56%
Decision tree 0.49 41% 48% 49%
Random
0.58 38% 60% 56%
forest
*Bold indicates the best-performing model (row heading) on the current metric (column
heading).
Given that different machine learning models utilize distinct
approaches for data analytics and pattern recognition, they may be capable of
identifying unique examples (observations) that are not detected by other models.
Thus, by leveraging the predictive outputs of multiple models, decision-makers may
benefit from combining their predictive outputs to identify additional target
observations that are exclusively recognized by specific models. The ShineThrough
Metric and Occlusion Metric, as proposed by Mali et al. (2022) and Restrepo et al.
(2022), are performance metrics that assess a model's ability to identify exclusive
observations. These metrics can be applied to evaluate both the individual and the
combination of two models. By comparing the metrics for individual models and the
combined model, decision-makers can observe the extent to which the combined
model detects extra true positives through the ShineThrough Metric and reduces false
negatives through the Occlusion Metric.
Tables 4.4 & 4.5 provide an overview of the MARS metrics
for Principle 13-15 and Principle 3, respectively. Table 4.3 Panel A reveals that the
difference in the ShineThrough scores between the individual and combined models is
mild, with the highest ShineThrough score of 0.02 in multiple combinations of two
models. Given that GPT2 offers the best recall score (suggesting GPT2 correctly
identified the highest number of true positive examples), the optimal combination of
two models to identify more true positive observations is that of GPT2 and GloVe.
However, the advantages of model combination are limited, as the combined
ShineThrough score is 0.02, signifying that a mere 2% of the total true positive
observations identified by all classifiers were exclusive to these two classifiers
(namely GPT2 and GloVe) and not detected by any other classifiers.
Meanwhile, as revealed in Table 4.4 Panel B, GPT2 offers
the lowest Occlusion score (0.08) among all individual models, suggesting that 8% of
the total true positive observations across all classifiers were incorrectly missed (false
negatives) by GPT2 but correctly recognized by at least one other classifier. The
combination of GPT2 and decision tree further decreased the Occlusion score from
0.08 to 0.03. Thus, the decision tree model is incorporated into the web application as
a supplementary classifier for users who wish to explore additional reviews after
reviewing all the records identified by the GPT2 model.
Logistic regression 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.03 - 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0
Random Forest 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.03 0.0 | 0.0 - 0.0 | 0.0 0.0 | 0.0
SVM 0.0 | 0.08 0.0 | 0.03 0.0 | 0.05 0.0 | 0.03 0.0 | 0.0 0.0 | 0.0 - 0.0 | 0.0
Word2Vec 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.11 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 -
Logistic regression 0.34 | 0.18 0.34 | 0.29 0.34 | 0.24 0.34 | 0.16 - 0.34 | 0.34 0.34 | 0.32 0.34 | 0.18
Random Forest 0.55 | 0.24 0.55 | 0.45 0.55 | 0.42 0.55 | 0.18 0.55 | 0.34 - 0.55 | 0.34 0.55 | 0.26
SVM 0.34 | 0.16 0.34 | 0.29 0.34 | 0.21 0.34 | 0.13 0.34 | 0.32 0.34 | 0.34 - 0.34 | 0.18
Word2Vec 0.26 | 0.11 0.26 | 0.21 0.26 | 0.18 0.26 | 0.16 0.26 | 0.18 0.26 | 0.26 0.26 | 0.18 -
sentiment analysis in various ranking tasks (e.g., Goldberg & Abrahams, 2018;
Zaman et al., 2021).
The model's ranking quality was evaluated based on the top
50, 100, 150, and 200 ranked reviews, as per the pre-determined cutoff. This study
utilized the number of true positive reviews at the top-ranked reviews and normalized
discounted cumulative gain (nDCG) as the performance metrics to evaluate the
ranking quality. Table 4.6 presents the results at various pre-determined cutoffs for
each control issue type. As shown in Table 4.6 Panel A, we can observe that GPT2
offers the best performance regarding the number of true relevant reviews and nDCG
at different cutoff points, consistent with findings in Table 4.3 Panel A. These results
suggest that the GPT2 model identified the highest number of relevant reviews at the
top of the ranking list and effectively prioritized them for control issues relating to
Principle 13-15. Smoke term analysis (column: 'Unigram') followed closely as the
second-best performing model.
Table 4.6, Panel B provides an evaluation of the ranking
performance of the models in identifying relevant reviews related to Principle 3 across
varying cutoff thresholds. Notably, while BERT identified the highest number of
reviews with mentions of control issues relating to Principle 3 in the top 50 ranked
reviews, smoke term analysis outperformed the former in terms of the raw number of
true positives and nDCG at the top 100, 150, and 200 ranked reviews. Nevertheless,
the integration of Word2Vec with BERT resulted in a slightly higher number of true
positive records compared to the smoke term analysis, indicating the effectiveness of
the proposed combination strategy.
Table 4.6 Ranking quality evaluation: number of true positive reviews and normalized
discounted cumulative gains (nDCG) in top-ranked reviews
Panel A: Principle 13_15
GPT2 GloVe Random forest GPT2+DT Unigram
True positive 45 29 31 - 41
nDCG 0.93 0.62 0.63 - 0.87
Top 50
Extra true positive 4
- - - -
(Total) (49)
True positive 60 45 46 - 49
nDCG 0.88 0.63 0.63 - 0.76
Top 100
Extra true positive 4
- - - -
(Total) (64)
True positive 61 54 55 - 53
nDCG 0.89 0.72 0.72 - 0.79
Top 150
Extra true positive 3
- - - -
(Total) (64)
True positive 63 56 57 - 60
nDCG 0.91 0.74 0.75 - 0.85
Top 200
Extra true positive 3
- - - -
(Total) (66)
Panel B: Principle 3
BERT Word2Vec GloVe BERT+W2V Unigram
True positive 11 7 9 - 10
nDCG 0.32 0.12 0.24 - 0.28
Top 50
Extra true positive 4
- - - -
(Total) (15)
True positive 16 8 12 - 18
nDCG 0.41 0.15 0.29 - 0.41
Top 100
Extra true positive 2
- - - -
(Total) (18)
True positive 18 14 14 - 22
nDCG 0.44 0.25 0.33 - 0.48
Top 150
Extra true positive 5
- - - -
(Total) (23)
True positive 19 17 16 - 24
nDCG 0.44 0.30 0.35 - 0.51
Top 200
Extra true positive 6
- - - -
(Total) (25)
60
50
40
30
20
10
0
1 21 41 61 81 101 121 141 161 181
Rank of reviews
25
20
15
10
0
1 21 41 61 81 101 121 141 161 181
Rank of reviews
and 200. Considering performance and interpretability, smoke term analysis appears
to be the best model for the task.
18
https://www.streamlit.io/
19
https://render.com/
Fig. 4.5 illustrates the workflows of the web application. The main
functionalities are:
Control issue type selection: The user can select one of the two control
issue types provided to detect the potential control problems from online
employee reviews. The classification of the control issue types is based on
the COSO’s 2013 internal control framework. (Figure 4.6)
Machine learning model selection: The user can select the pretrained
machine learning model for prediction. The introduction of the models and
their performance is provided. (Figure 4.7)
File upload: The user can upload a CSV file of textual narratives. (Figure
4.8)
Text analysis: Once the user uploads the file, the text classification and
ranking will be performed automatically.
Outputs preview and download: The user can preview the outputs of the
text analysis and download it in “csv” format. (Figure 4.8)
Figure 4.6: Main Page Layout for Selecting Control Issue Type
4.4 Conclusion
REFERENCES
Articles
Abrahams, A. S., Jiao, J., Fan, W., Wang, G. A., & Zhang, Z. (2013). What's buzzing
in the blizzard of buzz? Automotive component isolation in social media
postings. Decision Support System, 55(4), 871-882.
Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery
from social media. Decision Support Systems, 54(1), 87-97.
AFCE. (2020). 2020 GLOBAL STUDY ON OCCUPATIONAL FRAUD AND ABUSE.
https://acfepublic.s3-us-west-2.amazonaws.com/2020-Report-to-the-
Nations.pdf
Alden, M. E., Bryan, D. M., Lessley, B. J., & Tripathy, A. (2012). Detection of
financial statement fraud using evolutionary algorithms. Journal of Emerging
Technologies in Accounting, 9(1), 71-94.
Ashbaugh-Skaife, H., Collins, D. W., & Kinney Jr, W. R. (2007). The discovery and
reporting of internal control deficiencies prior to SOX-mandated audits.
Journal of Accounting and Economics, 44(1-2), 166-192.
Aswani, J., & Fiordelisi, F. (2020). Tournament Culture and Corporate Misconduct:
Evidence using Machine Learning. Available at SSRN.
Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting accounting fraud in
publicly traded US firms using a machine learning approach. Journal of
Accounting Research, 58(1), 199-235.
Bardhan, I., Lin, S., & Wu, S.-L. (2015). The quality of internal control over financial
reporting in family firms. Accounting Horizons, 29(1), 41-60.
Batista, G. E., & Monard, M. C. (2003). An analysis of four missing data treatment
methods for supervised learning. Applied Artificial Intelligence, 17(5-6), 519-
533.
Beneish, M. D. (1997). Detecting GAAP violation: Implications for assessing
earnings management among firms with extreme financial performance.
Journal of Accounting and Public Policy, 16(3), 271-309.
Beneish, M. D. (1999). The detection of earnings manipulation. Financial Analysts
Journal, 55(5), 24-36.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
Brahma, A., Goldberg, D. M., Zaman, N., & Aloiso, M. J. D. S. S. (2021). Automated
mortgage origination delay detection from textual conversations. Decision
Support Systems (Jan.), 140.
Breiman, L. J. M. l. (2001). Random forests. Machine Learning, 45, 5-32.
Calderon, T. G., Song, H., & Wang, L. (2016). Audit deficiencies related to internal
control: An analysis of PCAOB inspection reports. The CPA Journal, 86(2),
32.
Campbell, D., & Shang, R. (2021). Tone at the Bottom: Measuring Corporate
Misconduct Risk from the Text of Employee Reviews. Management Science,
68(9), 7034-7053.
Campbell, S., Li, Y., Yu, J., & Zhang, Z. (2016). The impact of occupational
community on the quality of internal control. Journal of Business Ethics,
139(2), 271-285.
Cao, S., Jiang, W., Yang, B., & Zhang, A. L. (2020). How to talk when a machine is
listening: Corporate disclosure in the age of AI (No. w27950). National
Bureau of Economic Research.
Cavicchini, A., Ferraro, F., & Samila, S. (2021). Under Pressure: Cultural and
Structure as Antecedents of Organizational Misconduct. Academy of
Management Proceedings (Vol. 2021, No. 1, p. 16077). Briarcliff Manor, NY
10510: Academy of Management.
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010a). Detecting management
fraud in public companies. Management Science, 56(7), 1146-1160.
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010b). Making words work:
Using financial text as a predictor of financial events. Decision Support
Systems, 50(1), 164-175.
Chen, H., Grant-Muller, S., Mussone, L., & Montgomery, F. (2001). A study of
hybrid neural network approaches and the effects of missing data on traffic
forecasting. Neural Computing & Applications, 10(3), 277-286.
Cheng, S., Felix, R., & Indjejikian, R. J. C. A. R. (2019). Spillover effects of internal
control weakness disclosures: The role of audit committees and board
connections. Comtemporary Accounting Research, 36(2), 934-957.
Chu, Y., Kaushik, A. C., Wang, X., Wang, W., Zhang, Y., Shan, X., Salahub, D. R.,
Xiong, Y., & Wei, D.-Q. (2021). DTI-CDF: a cascade deep forest model
towards the prediction of drug-target interactions based on hybrid features.
Briefings in Bioinformatics, 22(1), 451-462.
Coates, J. C., & Srinivasan, S. (2014). SOX after ten years: A multidisciplinary
review. Accounting Horizons, 28(3), 627-671.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational
psychological measurement, 20(1), 37-46.
Corritore, M., Goldberg, A., & Srivastava, S. B. (2020). The new analytics of culture.
Harvard Business Review, 98(1), 76-83.
Cortes, C., & Vapnik, V. J. M. l. (1995). Support-vector networks. 20, 273-297.
Craja, P., Kim, A., & Lessmann, S. (2020). Deep learning for detecting financial
statement fraud. Decision Support Systems, 139, 113421.
Dechow, P. M., Ge, W., Larson, C. R., & Sloan, R. G. (2011). Predicting material
accounting misstatements. Contemporary Accounting Research, 28(1), 17-82.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805.
Donelson, D. C., Ege, M. S., & McInnis, J. M. (2017). Internal control weaknesses
and financial reporting fraud. Auditing: A Journal of Practice & Theory,
36(3), 45-69.
Donelson, D. C., Kartapanis, A., McInnis, J. M., & Yust, C. G. (2021). Measuring
Accounting Fraud and Irregularities Using Public and Private
Goel, S., & Uzuner, O. (2016). Do sentiments matter in fraud detection? Estimating
semantic orientation of annual reports. Intelligent Systems in Accounting,
Finance and Management, 23(3), 215-239.
Goldberg, D. M., & Abrahams, A. S. J. D. S. S. (2018). A Tabu search heuristic for
smoke term curation in safety defect discovery. 105, 52-65.
Goldberg, D. M., Zaman, N., Brahma, A., & Aloiso, M. (2022). Are mortgage loan
closing delay risks predictable? A predictive analysis using text mining on
discussion threads. Journal of the Association for Information Science and
Technology, 73(3), 419-437.
Green, T. C., Huang, R., Wen, Q., & Zhou, D. (2019). Crowdsourced employer
reviews and stock returns. Journal of Financial Economics, 134(1), 236-251.
Guiso, L., Sapienza, P., & Zingales, L. (2015). The value of corporate culture.
Journal of Financial Economics, 117(1), 60-76.
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent
detection of financial statement fraud–A comparative study of machine
learning methods. Knowledge-Based Systems, 128, 139-152.
Hales, J., Moon Jr, J. R., & Swenson, L. A. (2018). A new era of voluntary
disclosure? Empirical evidence on how employee postings on social media
relate to future corporate disclosures. Accounting, Organizations and Society,
68, 88-108.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions
on knowledge and data engineering, 21(9), 1263-1284.
Hooghiemstra, R., Hermes, N., & Emanuels, J. (2015). National culture and internal
control disclosures: A cross‐country analysis. Corporate Governance: An
International Review, 23(4), 357-377.
Hoogs, B., Kiehl, T., Lacomb, C., & Senturk, D. (2007). A genetic algorithm
approach to detecting temporal patterns indicative of financial statement fraud.
Intelligent Systems in Accounting, Finance & Management: International
Journal, 15(1‐2), 41-56.
Howell, D. C. (2007). The treatment of missing data. The Sage handbook of social
science methodology, 208-224.
Huang, S., & Yang, X. (2010). Internal control report, quality of financial reports and
information asymmetry: Empirical evidence from listed companies in
Shanghai Stock Exchange. Journal of Finance and Economics, 36, 81-91.
Humpherys, S. L., Moffitt, K. C., Burns, M. B., Burgoon, J. K., & Felix, W. F.
(2011). Identification of fraudulent financial statements using linguistic
credibility analysis. Decision Support Systems, 50(3), 585-594.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR
techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422-
446.
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., &
Franco, L. (2010). Missing data imputation using statistical and machine
learning methods in a real breast cancer problem. Artificial intelligence in
medicine, 50(2), 105-115.
Ji, Y., Rozenbaum, O., & Welch, K. (2017). Corporate culture and financial reporting
risk: Looking through the glassdoor. Available at SSRN 2945745.
Johnstone, K., Li, C., & Rupley, K. H. (2011). Changes in corporate governance
associated with the revelation of internal control material weaknesses and their
subsequent remediation. Contemporary Accounting Research, 28(1), 331-383.
Karpoff, J. M., Koester, A., Lee, D. S., & Martin, G. S. (2017). Proxies and databases
in financial misconduct research. The Accounting Review, 92(6), 129-163.
Kim, Y. J., Baik, B., & Cho, S. (2016). Detecting financial misstatements with fraud
intention using multi-class cost-sensitive learning. Expert Systems with
Applications, 62, 32-43.
Kotsiantis, S., Koumanakos, E., Tzelepis, D., & Tampakas, V. (2006). Forecasting
fraudulent financial statements using data mining. International journal of
computational intelligence, 3(2), 104-110.
Koubli, E., Palmer, D., Rowley, P., & Gottschalg, R. (2016). Inference of missing
data in photovoltaic monitoring datasets. IET Renewable Power Generation,
10(4), 434-439.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. biometrics, 159-174.
Li, J. (2022). The effect of employee satisfaction on effective corporate tax planning:
Evidence from Glassdoor. Advances in accounting, 57, 100597.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R
news, 2(3), 18-22.
Lin, C.C., Chiu, A.A., Huang, S. Y., & Yen, D. C. (2015). Detecting the financial
statement fraud: The analysis of the differences between data mining
techniques and experts’ judgments. Knowledge-Based Systems, 89, 459-470.
Lin, Y. C., Wang, Y. C., Chiou, J. R., & Huang, H. W. (2014). CEO characteristics
and internal control quality. Corporate Governance: An International Review,
22(1), 24-42.
Liu, X. Y., & Zhou, Z. H. (2006). The influence of class imbalance on cost-sensitive
learning: An empirical study. Sixth International Conference on Data Mining
(ICDM'06),
Liu, Y., Zhang, S., Ding, B., Li, X., & Wang, Y. (2018). A cascade forest approach to
application classification of mobile traces. 2018 IEEE Wireless
Communications and Networking Conference (WCNC),
Ma, C., Liu, Z., Cao, Z., Song, W., Zhang, J., & Zeng, W. (2020). Cost-sensitive deep
forest for price prediction. Pattern recognition, 107, 107499.
Mali, N., Restrepo, F., Abrahams, A., & Ractham, P. J. S. I. (2022). Implementation
of mars metrics and Mars charts for evaluating classifier exclusivity: The
comparative uniqueness of binary classifier predictions. 12, 100259.
Mao, J., & Ettredge, M. (2016). Internal control deficiency disclosures among
Chinese reverse merger firms. Abacus, 52(3), 441-472.
Mao, M. Q., & Yu, Y. (2015). Analysts' cash flow forecasts, audit effort, and audit
opinions on internal control. Journal of Business Finance & Accounting, 42(5-
6), 635-664.
Mazza, T., & Azzali, S. (2015). Effects of internal audit quality on the severity and
persistence of controls deficiencies. International Journal of Auditing, 19(3),
148-165.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. J. a. p. a. (2013). Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781.
Restrepo, F., Mali, N., Abrahams, A., & Ractham, P. J. F. (2022). Formal definition
of the MARS method for quantifying the unique target class discoveries of
selected machine classifiers. 11(391), 391.
Rice, S. C., & Weber, D. P. J. J. o. A. R. (2012). How effective is internal control
reporting under SOX 404? Determinants of the (non‐) disclosure of existing
material weaknesses. 50(3), 811-843.
Rich, K. T., Roberts, B. L., & Zhang, J. X. (2018). Linguistic tone and internal control
reporting: evidence from municipal management discussion and analysis
disclosures. Journal of Governmental & Nonprofit Accounting, 7(1), 24-54.
Robertson, S. E. (1986). On relevance weight estimation and query
expansion. Journal of Documentation, 42(3), 182-188.
Su, L. N., Zhao, X. R., & Zhou, G. S. (2014). Do customers respond to the disclosure
of internal control weakness? Journal of Business Research, 67(7), 1508-1518.
Sun, T. (2018). The incremental informativeness of the sentiment of conference calls
for internal control material weaknesses. Journal of Emerging Technologies in
Accounting, 15(1), 11-27.
Sun, T. (2019). Applying deep learning to audit procedures: An illustrative
framework. Accounting Horizons, 33(3), 89-109.
Teoh, S. H. (2018). The promise and challenges of new datasets for accounting
research. Accounting, Organizations and Society, 68, 109-117.
Throckmorton, C. S., Mayew, W. J., Venkatachalam, M., & Collins, L. M. (2015).
Financial fraud detection using vocal, linguistic and financial cues. Decision
Support Systems, 74, 78-87.
Twala, B. (2009). An empirical comparison of techniques for handling incomplete
data using decision trees. Applied Artificial Intelligence, 23(5), 373-405.
Twala, B., & Nelwamondo, F. (2017). Enhancing the detection of financial statement
fraud through the use of missing value estimation, multivariate filter feature
selection and cost-sensitive classification. University of Johannesburg (South
Africa).
Vasarhelyi, M. A., Kogan, A., & Tuttle, B. M. (2015). Big Data in accounting: An
overview. Accounting Horizons, 29(2), 381-396.
Walker, S. (2020). A Needle Found: Machine learning does not significantly improve
corporate fraud detection beyond a simple screen on sales growth. Available at
SSRN.
Warren Jr, J. D., Moffitt, K. C., & Byrnes, P. (2015). How Big Data will change
accounting. Accounting Horizons, 29(2), 397-407.
West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: a
comprehensive review. Computers & security, 57, 47-66.
Winkler, M., Abrahams, A. S., Gruss, R., & Ehsani, J. P. J. D. s. s. (2016). Toy safety
surveillance from online reviews. 90, 23-32.
Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-
proportionate example weighting. Third IEEE international conference on data
mining,
Zaman, N., Goldberg, D. M., Abrahams, A. S., & Essig, R. A. (2020). Facebook
Hospital Reviews: Automated Service Quality Detection and Relationships
with Patient Satisfaction. Decision Sciences, 52(6), 1403-1431.
Zaman, N., Goldberg, D. M., Gruss, R. J., Abrahams, A. S., Srisawas, S., Ractham,
P , & Şeref, M M (2021). Cross-category defect discovery from online
reviews: Supplementing sentiment with category-specific
semantics. Information Systems Frontiers, 1-21.
Zhang, Y.L., Zhou, J., Zheng, W., Feng, J., Li, L., Liu, Z., Li, M., Zhang, Z., Chen,
C., & Li, X. (2019). Distributed deep forest and its application to automatic
detection of cash-out fraud. ACM Transactions on Intelligent Systems and
Technology (TIST), 10(5), 1-19.
Zhou, Y., & Makridis, C. (2019). Firm Reputation Following Financial Misconduct:
Evidence from Employee Ratings. Available at SSRN 3271455.
Zhou, Z. H., & Feng, J. (2019). Deep forest. National Science Review, 6(1), 74-86.
Zhou, Z. H.,& Liu, X. Y. (2005). Training cost-sensitive neural networks with
methods addressing the class imbalance problem. IEEE Transactions on
knowledge and data engineering, 18(1), 63-77.
APPENDIX A
INDICATOR VARIABLES AND FREQUENCY AFFECTED BY
MISSTATEMENTS OVER THE TEST PERIOD 2003-2008
Indicator Relative
Rank Description Frequency
variables frequency
1 inc_exp_s Equals 1 if misstatement affected net income, hence,
e shareholder equity but could not be classified in any specific
169 31.8%
income, expense or equity accounts below in this table, 0
otherwise
2 rev Equals 1 if misstatement affected revenues, 0 otherwise 127 23.9%
3 asset Equals 1 if misstatement affected an asset account that could
not be classified in a separate individual asset account in this 63 11.8%
table, 0 otherwise
4 rec Equals 1 if misstatement affected accounts receivable, 0
46 8.6%
otherwise
5 inv Equals 1 if misstatement affected inventory, 0 otherwise 29 5.5%
6 res Equals 1 if misstatement affected reserves accounts, 0
29 5.5%
otherwise
7 liab Equals 1 if misstatement affected liabilities, 0 otherwise 26 4.9%
8 cogs Equals 1 if misstatement affected cost of goods sold, 0
26 4.9%
otherwise
9 pay Equals 1 if misstatement affected accounts payable, 0
17 3.2%
otherwise
10 mkt_sec Equals 1 if misstatement affected marketable securities, 0
0 0.0%
otherwise
11 debt Equals 1 if misstatement affected allowance for bad debts, 0
0 0.0%
otherwise
Total 532 100%
Source: Bao et al. (2020) Panel B, Table 7.
APPENDIX B
THE SAMPLING FIRMS IN THE DATA EXPLORATION PHASE
APPENDIX C
Coding protocol for mentions of
Principle 3: “Organizational structure, authority, and responsibility”
Red flags: negative comments about the current or the change of organization
structure (e.g., “change for the sake of change”); confusion about roles and
responsibilities in the new structure; insufficient support to navigate the matrix
structure; inadequate or inconsistent authorization/empowerment; a perception
of role ambiguity (e.g., no clear job description); duplication of effort; no
work plan (e.g., work at a short notice, etc), and so forth.
2) No mention: If the review does not contain content related to the above, the
label should be “No mention”.
APPENDIX D
Coding protocol for mentions of
Principle 4: “Commitment to Competence”
Employee 13: "...Metrics are not accurate and are a source of stress
when trying to do what's right, but what's right isn't accounted for in
the metrics which are used for performance evaluations."
2) No mention: If the review does not contain content related to the above, the
label should be “No mention”.
APPENDIX E
Coding protocol for mentions of
Principle 11, 13-15: “Control over Technology, Information, &
Communication”
3) No mention: If the review does not contain content related to the above, the
label should be “No mention”.
BIOGRAPHY
Publications
Huang, L., Abrahams, A., & Ractham, P. (2022). Enhanced financial fraud
detection using cost‐sensitive cascade forest with missing value
imputation. Intelligent Systems in Accounting, Finance and Management,
29(3), 133-155.