0% found this document useful (0 votes)
11 views

Dissertation on Fraud Detection Using Machine Learning

Uploaded by

magotovictor5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Dissertation on Fraud Detection Using Machine Learning

Uploaded by

magotovictor5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

MACHINE LEARNING APPROACHES TO DETECT

FINANCIAL STATEMENT FRAUDS AND


ACCOUNTING CONTROL ISSUES

BY

LUKUI HUANG

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY(BUSINESS ADMINISTRATION)
FACULTY OF COMMERCE AND ACCOUNTANCY
THAMMASAT UNIVERSITY
ACADEMIC YEAR 2022

Ref. code: 25655902320018OLX


THAMMASAT UNIVERSITY
FACULTY OF COMMERCE AND ACCOUNTANCY

DISSERTATION

BY

LUKUI HUANG

ENTITLED

MACHINE LEARNING APPROACHES TO DETECT FINANCIAL STATEMENT

FRAUDS AND ACCOUNTING CONTROL ISSUES

was approved as partial fulfillment of the requirements for

the degree of Doctor of Philosophy (Business Administration)

on July 6, 2023

Chairman
(Assistant Professor Richard Gruss, Ph.D.)

Member and Advisor


(Associate Professor Peter Ractham, Ph.D.)

Member and Co-Advisor


(Associate Professor Alan Abrahams, Ph.D.)
Member
(Professor Charturong Tantibundhit, Ph.D.)
Member
(Associate Professor Juthamon Sithipolvanichgul, Ph.D.)
Dean
(Associate Professor Somchai Supattarakul, Ph.D.)
(1)

Dissertation Title MACHINE LEARNING APPROACHES TO


DETECT FINANCIAL STATEMENT FRAUDS
AND ACCOUNTING CONTROL ISSUES
Author Lukui Huang
Degree Doctor of Philosophy (Business Administration)
Faculty Faculty of Commerce and Accountancy
University Thammasat University
Dissertation Advisor Associate Professor Peter Ractham
Dissertation Co-Advisor Associate Professor Alan Abrahams
Academic Year 2022

ABSTRACT

Financial statement fraud is a serious concern for investors and other


stakeholders. According to the Association of Certified Fraud Examiners' 2022
Report on global occupational fraud, fraud costs an estimated 5 % loss relative to
revenue each year. Among them, financial statement fraud accounts for only 9%
of cases but is the costliest type of occupational fraud, with a median loss of
$593,000 in 2022. Traditionally, a company relies on internal control activities to
prevent and detect fraud. Firms with a failure in internal control are more likely to
indulge in fraud and have a higher probability of filing for bankruptcy. The
Sarbanes-Oxley Act (SOX) requires listed companies to disclose material
weaknesses in internal controls over financial reporting. This requirement is
expected to provide investors with an early warning signal regarding the
reliability of reported financial information. However, regulators and practitioners
have expressed concerns about the unreported control weaknesses and the
reliability of the SOX 404 reports, making detecting control issues increasingly
important to investor protection.
Recent artificial intelligence (AI) developments have precipitated
substantive changes in accounting education, research, and practice. As an
application or subset of AI, machine learning allows machines to learn from data
without being programmed explicitly. This dissertation aims to provide solutions

Ref. code: 25655902320018OLX


(2)

to two extremely difficult and related accounting problems using advanced


machine learning techniques: financial statements fraud detection and accounting
control issues identification.
The first study proposes a novel fraud detection model based on an
ensemble machine learning algorithm known as Cost-sensitive Cascade Forest.
The proposed fraud detection model significantly outperforms the baseline, and
the performance is further enhanced with appropriate missing data treatment.
The second study focuses on accounting control identification. The
results demonstrate the great potential of online employee reviews in signaling
accounting control issues, which has yet to be explored in the literature. In the
study, online employee reviews were collected, manually labeled according to the
COSO’s internal control framework (2013), trained with advanced data mining
techniques, and, finally, a web application based on the experimental results was
developed and deployed in a cloud server.

Keywords: fraud detection, internal control, machine learning, cascade forest,


missing data treatment

Ref. code: 25655902320018OLX


(3)

ACKNOWLEDGEMENTS

First and foremost, I would like to express my sincere gratitude to my


advisor, Associate Professor Dr. Peter Ractham, for all his guidance and
coordination, and for introducing me to my co-advisor, Associate Professor Dr.
Alan Abrahams. I highly appreciate their tremendous and continuous support
throughout my Ph.D. studies. I will always remember the explicit comments on
every manuscript, as well as all the encouragement and motivation in almost
every email from Dr. Abrahams. Even though we have not met in person, they
have given me care and support which is more than I could ask for.
I would also like to thank the rest of my dissertation committees:
Assistant Professor Dr. Richard Gruss, Associate Professor Dr. Juthamon
Sithipolvanichgul, and Professor Dr. Charturong Tantibundhit, for their valuable
comments that truly improve my dissertation
My special appreciation goes to Professor Dr. Monvika
Phadoongsitthi and Associate Professor Dr. Juthamon Sithipolvanichgul, who
have helped me when I most needed it. Without them, I would not even start this
journey. Because of that, I could not thank them enough. As a Chinese would say,
they are my “贵人”
Seven years is a long time and many things have happened during this
journey. I feel lucky that I chose Thammasat Business School and met so many
people with good hearts. Many of them have accompanied me through some of
the darkest times of my life. Last but not least, I would like to thank my family
and my boyfriend who have been with me and supported me no matter what
happened. I wish that he can also receive his Ph.D. degree this year and that we
can begin a new journey in life together.

Lukui Huang

Ref. code: 25655902320018OLX


(4)

TABLE OF CONTENTS

Page
ABSTRACT (1)

ACKNOWLEDGEMENTS (3)

LIST OF TABLES (7)

LIST OF FIGURES (8)

CHAPTER 1 INTRODUCTION 1

CHAPTER 2 BACKGROUND AND LITERATURE REVIEW 3

2.1 Study 1: Financial fraud detection 3


2.1.1 Background 3
2.1.2 Literature review 4
2.2 Study 2: Accounting control issues identification 7
2.2.1 Background 7
2.2.2 Literature review 9

CHAPTER 3 STUDY 1: FINANCIAL FRAUD DETECTION 12

3.1 The proposed fraud detection model 12


3.1.1 An introduction to Deep Forest and Cascade Forest 12
3.1.2 Cost-sensitive Cascade Forest 13
3.2 The sample and data 14
3.2.1 Financial variables 14
3.2.2 Sample selection procedure 15
3.2.3 Missing value treatments 16

Ref. code: 25655902320018OLX


(5)

3.2.4 Labeling 17
3.3 Experimental results 18
3.3.1 Experimental setup 18
3.3.2 Evaluating metrics 20
3.3.3 Experimental results and analysis 22
3.4 Supplement analysis 25
3.4.1 The size effect of the training sets 25
3.4.2 Feature importance of cost-sensitive cascade forest 26
3.4.3 The effect of more financial variables 27
3.5 Conclusion 30
3.5.1 Implications for Research 31
3.5.2 Implications for Practice 31
3.5.3 Limitations 32
3.5.4 Future Work 32

CHAPTER 4 STUDY 2: ACCOUNTING CONTROL ISSUES 34


IDENTIFICATION

4.1 Data exploration 35


4.2 A dataset of online employee reviews for identifying 38
accounting control issues
4.3 A web application for accounting control issues 42
identification
4.3.1 Text classification methods 43
4.3.2 Evaluation metrics 46
4.3.3 The selection of built-in prediction models 47
4.3.4 Software description 57
4.4 Conclusion 63

REFERENCES 65

Ref. code: 25655902320018OLX


(6)

APPENDICES

APPENDIX A 75
APPENDIX B 76
APPENDIX C 77
APPENDIX D 78
APPENDIX E 79

BIOGRAPHY 81

Ref. code: 25655902320018OLX


(7)

LIST OF TABLES

Tables Page
3.1 Descriptive statistics of selected financial variables 15
3.2 Sample selection 17
3.3 Distribution of fraud firms over 1991-2014 18
3.4 Experimental results 23
3.5 Feature importance of the top-10 financial variables and the related 27
accounts affected by the detected frauds
3.6 Experimental results using 187 variables 29
4.1 Prevalence of accounting control issues in random sample of 37
employee reviews
4.2 Top 20 unigrams retrieved by Correlation Coefficient score 45
4.3 Experimental results 48
4.4 MARS Metric for Principle 13-15 50
4.5 MARS Metric for Principle 3 51
4.6 Ranking quality evaluation 54

Ref. code: 25655902320018OLX


(8)

LIST OF FIGURES

Figures Page
3.1 Illustration of the workflow for the main experiments 20
3.2 Effect of training size on models’ AUC with missing value 26
treatments
3.3 AUC vs Time per Layer (in seconds) for various feature set sizes 30
4.1 Illustration of the workflow for the dataset 39
4.2 Accumulated true positive in top ranked reviews for Principle 13-15 55
4.3 Accumulated true positive in top ranked reviews for Principle 3 55
4.4 Web application architecture 58
4.5 Web application components 58
4.6 Main Page layout and layout for selecting control issue type 60
4.7 Layout for selecting a machine learning prediction model 61
4.8 Layout for file upload and prediction preview 62

Ref. code: 25655902320018OLX


1

CHAPTER 1
INTRODUCTION

Fraudulent financial statements result from manipulations of financial


data, which can lead to disastrous consequences for investors. According to the global
survey conducted by the Association of Certified Fraud Examiners (ACFE) in 2022,
fraud costs companies an estimated 5% of their annual revenue. Among various types
of occupational fraud, financial statement fraud was found to be the least common (9%
of cases) but the most expensive ($593,000 median loss). Financial statement fraud is
extremely challenging to detect. Traditional methods for preventing and detecting
fraud include internal control activities, internal and external audits, risk management
systems, whistle-blowing hotlines, and investigation departments. However, these
methods are typically expensive and time-consuming, and can be imprecise.
Among these traditional methods, internal control systems are the most
fundamental mechanism that assures the reliability of financial and accounting
information. They are defined as the accounting and auditing processes used to ensure
the integrity of financial reporting and regulatory compliance. An internal control
breakdown is often associated with unreliable financial reporting, financial
restatement, or even financial fraud which may bring tremendous damages to relevant
stakeholders (e.g., investors). To protect the investors, the Sarbanes Oxley Act
Section 404 requires all financial reports to include an Internal Controls Report (the
SOX404 Report) which contains opinions on the effectiveness of internal control
from both the management teams and external auditors. For external stakeholders, the
SOX Report is the only regular information source regarding the effectiveness of
internal control implemented in a firm. The reliability of SOX404 Reports,
nevertheless, has been questioned by regulators, practitioners, and researchers (e.g.,
Cheng et al., 2019; Plumlee & Yohn, 2010; Rice & Weber, 2012), and there seem to
be no substantive consequences of failing to report existing internal control
weaknesses (Rice, Weber et al. 2015), making the information asymmetry problem
dangerously severe.

Ref. code: 25655902320018OLX


2

The intellectual use of artificial intelligence (AI) has precipitated


substantive changes in accounting education, research, and practice. The objective of
this dissertation is to provide intelligent solutions to the problems of financial
statements fraud detection and accounting control issues identification using machine
learning techniques.
The first study (Study 1) focuses on financial fraud detection. A new
fraud detection model, Cost-effective Cascade Forest, is proposed with appropriate
missing data treatments for performance enhancement. Study 1 of this dissertation has
already been successfully published:
Huang, L., Abrahams, A., & Ractham, P. (2022). Enhanced financial fraud
detection using cost‐sensitive cascade forest with missing value imputation
Intelligent Systems in Accounting, Finance and Management, 29(3), 133-155.
The second study (Study 2) provides a solution to identify accounting
control issues from online employee reviews. Accounting control issues were
categorized and coded according to the Internal Control Integrated Framework (2013)
issued by the Committee of Sponsoring Organizations of the Treadway Commission,
the most recognized implementation guidance for internal control practice worldwide.
Selected Natural Language Processing techniques/ text mining techniques were used
to rank the reviews for control issues identification. A web application was developed
with built-in prediction models for users to explore and detect accounting control
issues.
The rest of the dissertation is organized as follows. Section 2 introduces
the background and reviews the literature on financial fraud detection and internal
control identification in accounting research. The experimental setting and results of
Study 1 are shown in Section 3, and Section 4 presents the results of Study 2.

Ref. code: 25655902320018OLX


3

CHAPTER 2
BACKGROUND AND LITERATURE REVIEW

2.1 Study 1: Financial fraud detection

2.1.1 Background
Firms may intentionally distort their financial statements to appear
more attractive, unlike unintentional errors. Such fraud is deliberately obscured and
thus challenging to detect, even for professionals (Earley, 2015; Warren Jr et al.,
2015). A report by the world's largest anti-fraud organization, the Association of
Certified Fraud Examiners (AFCE), found that financial fraud caused over $3.6
billion in damage from January 2018 to September 2019. Only 15% of fraud cases
were detected by internal auditors and 4% by external auditors (AFCE, 2020). To
improve detection, auditing standards (e.g., the Statement of Auditing Standards
No.99: Auditors' Responsibility for Fraud Detection) require more proactive auditing.
Data mining, which uncovers insights from large data, has advanced capabilities to
classify and predict. Given data mining's popularity and availability, accounting firms
have used it in decision support systems to provide early warnings of financial
statement fraud (Hajek & Henriques, 2017).
Prior research has investigated the utility of various data mining
techniques to aid auditors in fraud detection (Ngai et al., 2011; West & Bhattacharya,
2016). These include decision trees(Kotsiantis et al., 2006), support vector machines
(Cecchini et al., 2010b), neural networks (e.g., Fanning & Cogger, 1998; Lin et al.,
2015), evolutionary algorithms(e.g., Alden et al., 2012; Hoogs et al., 2007), ensemble
models (e.g., Bao et al., 2020; Hajek & Henriques, 2017), and text mining (e.g., Goel
& Gangolly, 2012; Goel & Uzuner, 2016; Throckmorton et al., 2015). Although some
studies used financial, linguistic, and textual data (e.g., Craja et al., 2020; Hajek &
Henriques, 2017), findings on whether linguistic or textual data provide extra
information are inconsistent and depend on the data selected. However, studies using
financial data consistently show that financial data, especially with advanced data
mining, provides useful information for fraud detection.

Ref. code: 25655902320018OLX


4

Study 1 develops a new model to predict financial statement fraud


in some US firms. Extending Deep Forest (Zhou & Feng, 2019), an ensemble learning
algorithm, this study proposes a cost-sensitive model considering different
misclassification costs in fraud investigation. Ensemble learning methods are state-of-
the-art machine learning algorithms used in recent fraud studies (e.g., Bao et al., 2020;
Hajek & Henriques, 2017). Deep Forest has a deep structure like neural networks and
impressive performance in complex tasks. It has applications in various domains
including cash-out fraud (Zhang et al., 2019), drug–target interactions (Chu et al.,
2021), price prediction(Ma et al., 2020) and mobile network traffic classification (Liu
et al., 2018). The proposed modified Deep Forest is used as the fraud detection
method. Experimental results show the proposed model outperforms the baseline,
especially when missing data is imputed with zeros.

2.1.2 Literature review


Given the fraud instances are relatively rare, building a financial
fraud detection model is challenging. Research on predicting financial statement fraud
with data analytics involves at least three key model decisions besides choosing for
the underlying learning algorithm: (1) Selection for predictors (features) to use (ratios
vs. raw data); (2) Approaches to handle the extreme class imbalance between fraud
and non-fraud cases (sampling vs. weighting); (3) missing values treatment.
2.1.2.1 Input attribute decisions: ratios vs raw data
Early studies (e.g., Beneish, 1997, 1999) calculated
multivariate scores using expert-identified financial statement variables for input
feature selection. Dechow et al. (2011) and (Perols, 2011) documented the financial
statement fraud predictors in the literature, and reveal that these fraud predictors are
often financial ratios identified by experts based on theories of earnings management
or corporate governance (Perols et al., 2017).
Instead of limiting fraud predictors to specific formats, raw
financial data from decomposing the fraud predictors are used as alterative model
inputs (e.g., Bao et al., 2020; Cecchini et al., 2010a). Advanced data mining
techniques and models using dynamic raw financial variable combinations can also
effectively identify fraud. Cecchini et al. (2010a) used a support vector machine with

Ref. code: 25655902320018OLX


5

a nonlinear kernel function mapping raw financial variables to detect financial


statement fraud. Bao et al. (2020) used RUSBoost, a data mining method, and
achieved the best results with raw rather than traditional financial ratio variables,
suggesting the choice of variable transformations depends on the data mining method.
2.1.2.2 Class imbalance decision: sampling vs. weighting
Sampling methods, particularly undersampling, are
commonly utilized in financial fraud research to address the extreme class imbalance
issue in fraud datasets criteria (e.g., Craja et al., 2020; Ravisankar et al., 2011).
Undersampling involves matching fraudulent observations with non-fraudulent
examples based on specific criteria. However, this approach has limitations in terms
of reflecting real-world scenarios and may struggle with generalization due to the
challenges of consistently identifying appropriate matching records. To overcome this
limitation, alternative resampling strategies were employed using extensive databases.
Dechow et al. (2011) propose the F-Score for fraud detection, employing logistic
regression and verifying their model using data from all non-financial publicly traded
firms in the United States. Perols (2011) generates a more balanced training sample
by randomly selecting non-fraud observations and comparing multiple financial fraud
detection algorithms. Building upon Perols' work, Perols et al. (2017) introduce
Multi-Subset Observation Undersampling (OU) as a method to address the class
imbalance issue in fraud datasets. OU involves randomly collecting subsets of non-
fraud samples at various degrees of sub-sample sizes for training the model. They
systematically evaluate two additional undersampling strategies and demonstrate that
OU consistently improves prediction performance. In a recent study, Bao et al. (2020)
further improve the fraud prediction performance for U.S. listed firms by applying
RUSBoost, an ensemble model that incorporates an in-built random sampling
function for the majority class.
Sampling is not the sole solution for addressing class
imbalance in fraud detection. An alternative approach is cost-sensitive weighting,
which has gained recognition in the data mining literature as a viable alternative to
sampling methods (for a review, see Galar et al., 2011; He & Garcia, 2009). Cost-
sensitive learning, introduced by Elkan (2001), is frequently utilized in imbalanced
learning scenarios and biases the classifier towards the minority class by assuming

Ref. code: 25655902320018OLX


6

higher misclassification costs for that class. Empirical studies in various domains have
demonstrated the superiority of cost-sensitive learning techniques over sampling
methods (e.g., Liu & Zhou, 2006; Zhou & Liu, 2005).
In the field of financial fraud detection, researchers have
examined algorithms with cost-sensitive concept, such as weighted SVM, cost-
sensitive Naïve Bayes, and KNN, to address high class imbalance (Moepya et al.,
2014; Twala & Nelwamondo, 2017). Additionally, the financial misstatement
prediction model proposed by Kim et al. (2016) employed MetaCost, a data
preprocessing technique to reflect asymmetric misclassification costs, to minimize
cost during the training. However, it is worth noting that the use of cost-sensitive
classification for imbalanced learning remains limited in the financial statement fraud
literature.
2.1.2.3 Dealing with missing data
Missing data is a common challenge in fraud research.
Research from the data mining literature offers various solutions to address this issue
(Feelders, 1999; Twala, 2009). Surprisingly, only a few existing fraud studies have
thoroughly discussed this issue and the impact of different missing value treatment on
these experimental results largely remain unknown. In early fraud studies like
Cecchini et al. (2010a), Dechow et al. (2011), and Perols (2011), complete case
analysis is employed to handle missing data. This method involves excluding entire
firm-observations that contain missing values. Other approaches in fraud research
adopt statistical techniques such as mean and zero imputation, or machine learning
techniques like support vector machines to handle missing values. For instance, Perols
et al. (2017) replace missing values with global means during data preprocessing for
fraud prediction. Walker (2020) impute missing values as zero, and partially the same
approach is employed in Bao et al. (2020) before inputting the data into their
proposed fraud detection models. Craja et al. (2020) and Hajek & Henriques (2017)
employ various machine learning algorithms to estimate missing values based on the
available data.
Complete case analysis, which excludes the entire records
with missing values, is a crude approach to handling missing data. It can introduce
bias and discard useful information. Alternatively, employing statistical techniques

Ref. code: 25655902320018OLX


7

that substitute missing values with zero, mean, or median can enhance performance.
However, the improvement depends on input data characteristics like variance and
skewness. More advanced machine learning techniques impute missing values by
estimating them from available data (Batista & Monard, 2003; Howell, 2007). In
short, the benefits of different missing data treatments depend on the problem domain,
data patterns, and dataset size (Allison, 2001).
2.1.2.4 Summary and discussion of prior research
In summary, the selection and structure of fraud predictors
should be determined by the prediction algorithm employed. Though undersampling
is common to handel class imbalance issue in fraud studies, it often lacks out-of-
sample testing and thus brings generalizability conerns. Research using large datasets
to enable realistic out-of-sample testing is lacking. The impact of various missing
value treatments on the fraud prediction models is also very limited in the fraud
literature, especially for cost-sensitive methods.

2.2 Study 2: Accounting control issues identification

2.2.1 Background
Internal control has long been recognized as a fundamental process
assuring the reliability of accounting and financial information. The Public Company
Accounting Oversight Board (PCAOB) defines a material weakness in internal
control (ICMW) as, “a deficiency, or a combination of deficiencies, in internal control
over financial reporting, such that there is a reasonable possibility that a material
misstatement of the company’s annual or interim financial statements will not be
prevented or detected on a timely basis ” Weak internal control is linked to a higher
risk of fraudulent activities (e.g., employees misappropriating assets) as it provides
opportunities for managers or other employees to override the control for their self-
interests (Donelson et al., 2017). Disclosure of internal control weakness may damage
investors’ confidence in corporate governance (Johnstone et al., 2011). It may raise
customers’ concerns about the product/service’s quality, eventually leading to a
decline in sales growth (Su et al., 2014). In addition, firms are also more likely to file
for bankruptcy if their internal controls are found or disclosed with material weakness

Ref. code: 25655902320018OLX


8

(Doyle et al., 2007). Therefore, responding to the demand for high internal control
quality, the Sarbanes-Oxley Act of 2002 (SOX) mandates and specifies
management’s (Section 404 (a)) and auditors’ (Section 404 (b)) responsibility in
reporting their evaluation on the effectiveness of internal controls over financial
reporting (ICFR).
To form an opinion of the effectiveness of ICFR, auditors are also
required to plan the audit and collect an appropriate amount of audit evidence
according to the preliminary evaluation of their clients’ internal control (PCAOB
2007). The extent of substantive testing of financial statement accounts and
disclosures often relies on the effectiveness of ICFR. Therefore, the evaluation of
ICFR is essential for an audit. However, regulators (e.g., PCAOB) have expressed
great concerns about the ICFR audit deficiencies in practices (Franzel, 2015), as an
upward trend is observed in the percentage of firms with ineffective ICFR and
material weakness (Calderon et al., 2016), and ICFR audit deficiencies are the most
frequent causes of their annual inspections (Franzel, 2015). From a practical
perspective, therefore, there is a high demand for a more effective evaluation of ICFR
and identification of ICMW for investors, auditors, regulators, and other stakeholders.
The importance of control effectiveness can also be inferred from
the extensive literature regarding internal controls. Various ICMW prediction models
using firm characteristics, financial indicators, and external factors have been
proposed in existing literature (e.g., Ashbaugh-Skaife et al., 2007). Taking advantage
of the latest development in data mining techniques, researchers have begun to turn to
new types of data sources (e.g., conference calls) for incremental information to
enhance ICMW identification (e.g., Nasir et al., 2021; Rich et al., 2018). However,
these research studies are all based on corporate disclosure (e.g., financial reporting,
conference calls). The main issue of using corporate disclosure, voluntarily or
mandatory, to infer or predict internal control implementations is that this information
is provided by the firms, making disclosure information subject to the possibility of
being managed or even manipulated. Data sources that the firm cannot manage may
tell a different story. Therefore, these alternative data sources –outside the direct
management of the firm- are valuable to mine. Study 2 focuses on online employee

Ref. code: 25655902320018OLX


9

reviews as a new type of data source to infer accounting control practices in a firm,
which has not been examined in the literature yet.

2.2.2 Literature review


Predicting material weaknesses can be achieved by developing a
model using IC determinants as the predictors (e.g., Ashbaugh-Skaife et al., 2007).
Accounting literature has well documented the internal and external determinants, as
well as the economic consequences, of IC quality (e.g., Coates & Srinivasan, 2014;
Doyle et al., 2007). In general, these determinants include board characteristics (e.g.,
Campbell et al., 2016; Lin et al., 2014), ownership structure (Bardhan et al., 2015),
audit-related variables (e.g., Mazza & Azzali, 2015), regulatory environment (e.g.,
Mazza & Azzali, 2015), national culture (e.g., Hooghiemstra et al., 2015), and other
firm characteristics (e.g., Mao & Ettredge, 2016), and business environment variables
(e.g., Mao & Yu, 2015).
With the availability of advanced data analytics, recent studies have
shown that the traditional ICMW detection models can be enhanced by adding
indicators extracted from Big Data, which is usually defined by its characteristics in
terms of volume, velocity, variety, veracity and value (Rich et al., 2018; Vasarhelyi et
al., 2015). For example, Rich et al. (2018) study the textual content of MD&A
disclosures in corporate annual reports. They show that Management Discussion and
Analysis (MD&A) contains important information about reporting future internal
control weaknesses. Sun (2018) conducts a textual analysis of conference calls and
provides evidence that sentiment factors in the conference calls’ transcripts have
predictive value for future ICMW. Nasir et al. (2021) build an early warning system
to detect internal control material weakness using the design science research
paradigm.
High quality ICFR audits can be expensive due to information
asymmetry (Huang & Yang, 2010). Using data analytics techniques and new types of
data (e.g., textual data), auditors can extract potentially useful information from
various data sources to reduce information asymmetry and assist their decision-
making at relatively low costs. However, it appears that existing studies only focus on
textual data extracted from management communications to stakeholders, such as

Ref. code: 25655902320018OLX


10

conference calls and MD&A. These studies assume that management is aware of the
ICMW and may communicate it with stakeholders unconsciously or deliberately.
While this assumption is plausible, under certain circumstances (such as fraud),
management may be motivated to conceal their private information. In such cases,
management, aware of machine processors or other advanced data analytics, may
change their disclosure strategy to manipulate their audience. For example, Cao et al.
(2020)show that when the managements know their disclosure readers are machines,
they will manage the textual sentiment and audio emotion in disclosure. Thus, an
analysis of management disclosure may not be as worthwhile as before, even with the
latest data mining techniques. Alternative data sources – such as public employee
reviews – that are out of the control of management become more critical and may
contain more valuable information compared to the traditional ones.
Job review websites (e.g., Glassdoor, Indeed, Vault, etc) usually
require reviewers to leave anonymous comments about their firms. Since employers
are not allowed to delete or change employee reviews, this study assumes that
employee opinions from the job review websites reflect employees’ genuine thoughts
about their employers. These reviews can be a rich private information source about
employees’ working conditions and opinions (Teoh, 2018). As a new type of data,
they are used to understand and re-construct corporate culture (Corritore et al., 2020),
which is then linked to research relating to organizational misconduct (Aswani &
Fiordelisi, 2020; Cavicchini et al., 2021), firm performance (Guiso et al., 2015), and
financial reporting risk (Ji et al., 2017). Other studies treat employee reviews as a
source to provide incremental information for stock returns prediction (Green et al.,
2019), future corporate disclosure (Hales et al., 2018), corporate misconduct
prediction (Campbell & Shang, 2021), and financial misconduct risks evaluation
(Zhou & Makridis, 2019). The majority of existing research uses the readily available
firm ratings to proxy employee opinions (e.g., Campbell & Shang, 2021), with a few
exceptions performing an natural language processing (NLP) analysis of the firms’
employee reviews (e.g., Cavicchini et al., 2021).
However, online employee reviews have not yet been used to infer
the implementation of internal controls. While top management is responsible for
planning, implementing, and supervising internal controls, employees at all levels are

Ref. code: 25655902320018OLX


11

the main players in the control process. The working environment described by online
employee reviews is also the control environment of a firm. Thus, these reviews may
contain information about the firm's compliance with control policies and procedures,
as well as control deficiencies in design or operations.
In summary, although existing studies have shown the benefits of
utilizing Big Data and data analytics for the early detection of internal control
material weakness, these prior works suffer from using only data disclosed by the
firms (e.g., annual reports, conference calls). As management may employ a counter-
strategy to these advanced data analytics techniques and disclose only self-serving
information, analyzing data from sources that are out of the control of management
becomes potentially more valuable. Online employee reviews thus become worthy of
investigation to infer the control practices in a firm.

Ref. code: 25655902320018OLX


12

CHAPTER 3
STUDY 1: FINANCIAL FRAUD DETECTION

3.1 The proposed fraud detection model

3.1.1 An introduction to Deep Forest and Cascade Forest


Ensemble methods in machine learning aim to leverage the principle
of "the wisdom of the crowd," where the collective opinions of a large group of
individuals can outperform a single expert. The proposed model in this study is
modified from Deep Forest (DF) proposed by Zhou and Feng (2019). By intelligently
combining the results from a set of base learners (decision trees), DF effectively
enhance accuracy, generalization, and robustness (Zhou, 2012). In the field of fraud
research, various ensemble learning techniques have demonstrated significant success
in real-world applications and have yielded impressive results (e.g., Bao et al., 2020;
Craja et al., 2020).
The core component of Deep Forest (DF) is the cascade forest (CF)
structure, whose architecture is inspired by the layer-to-layer processing in deep
neural models. Similar to the neural networds, CF consists of multiple layers, where
each layer comprises multiple base classifiers that are ensemble decision-tree forests.
Users can switch different types of forests to promote diversity 1 . Information is
processed in a feed-forward manner. At each layer, input features are randomly
assigned to the base classifiers. The first layer solely handles the original features,
while the inputs of the connected layer consist of the original features and the
processed one from previous layers. K-fold cross-validation is employed within each
base classifier to mitigate overfitting risks. Layers are added sequentially. After
adding a new layer, the algorithm assesses the performance gain, and the training
process terminates when no significant gain is observed. As a result, the number of
layers in the CF is automatic and can be self-adaptively adjusted to achieve optimal
performance.

1
The default setting of CF has two estimators and each estimator consists of one random forest (RandomForest)
and one completely random forest (ExtraTrees).

Ref. code: 25655902320018OLX


13

3.1.2 Cost-sensitive Cascade Forest


Cost-sensitive learning is an approach in machine learning that
considers the differential costs associated with misclassifications in a classification
problem. For financial fraud, misclassification costs are asymmetric, where Type II
(false negative) errors (failing to detect fraud) are much costly than Type I (false
positive) errors (misclassifying non-fraud) (Hajek & Henriques, 2017). In this study, a
cost-sensitive cascade forest (CSCF) with self-adapting depth based on recall
(sensitivity) is proposed for fraud detection. CSCF can be considered as a variant of
CF, which has the same base classifiers as the original cascade forest as the original
method but more heavily weights the minority (fraud) class to penalize its
misclassification.
This study defines a cost matrix based on cost-proportionate
weighting of training samples according to class distribution (Zadrozny et al., 2003).
The cost matrix influences the goal of cost-sensitive learning: minimizing total
classification cost. A cost matrix based on class distribution determines total cost to
minimize. Since each decision tree is generated from a bootstrap sample with a
different class distribution, balanced weighting is applied to each bootstrap sample for
every grown tree instead of the entire training set.
To simplify the process, the class weights in the Cost-sensitive
Cascade Forest (CSCF) are determined by the inverse proportionality to their
frequencies within each bootstrap sample 2. In this specific scenario of a two-class
fraud detection problem, the weight calculation for class "i" in a subsample can be
described as follows:

(1)

where indicate the weight for class i; is the total


number of observations in a bootstrap sample; the number 2 is set because fraud
detection is a binary problem; indicates the total number of observations
of class i in that bootstrap sample.

2
An alternative approach following the idea of cost-sensitive learning is to change the class weight in the decision
trees of the forests when calculating the “impurity” score of a selected split point Bootstrap Class Weighting was
employed as it gave better results. Class Weighting and Bootstrap Class Weighting can be achieved by setting the
class_weight argument to the corresponding value in each base classifier (forest) in Python.

Ref. code: 25655902320018OLX


14

Another modification of CSCF is the adjustment to the depth-


terminating criterion of the original Cascade Forest (CF). Instead of using accuracy as
the criterion, CSCF utilizes recall (sensitivity) to determine the terminating depth
during training. This modification is motivated by the variation in misclassification
costs, which justifies the emphasis on achieving higher recall in fraud detection
studies(Craja et al., 2020). By redefining the stopping criteria for layer growth, the
CSCF further improves its performance. The growth of the cascading forest is halted
when there is no increase in the recall score for the additional layer. This approach
ensures that the CSCF maximizes its ability to correctly identify relevant instances,
taking into account the significant consequences of missing fraudulent cases.

3.2 The sample and data

3.2.1 Financial variables


Previous research in fraud detection has commonly utilized
financial ratios as fraud indicators, but a recent study by Bao et al. (2020)
demonstrated that using raw financial variables alone can be sufficient for detecting
fraud. Building upon Bao's findings, this study follows a similar approach by
exclusively using raw financial items and adopting the same variable selection
procedure. The initial list of raw financial items used in Bao et al. (2020) combines
findings from Cecchini, Aytug et al. (2010), Dechow, Ge et al. (2011), and Dechow,
Ge et al. (2011). These raw financial items are either theory-based or decomposing
from the expert-identified financial ratios. After eliminating the financial variables
with more than 25% of their values missing, the resulting list consists 28 raw financial
variables. The descriptive statistics of the raw financial variables used in this study is
shown Table 3.1 below.

Ref. code: 25655902320018OLX


15

Table 3.1: Descriptive statistics of selected financial variables

3.2.2 Sample selection procedure


Consistant with the previous research that the theory-motivate
financial variables are generated from in this study, the financial fraud detection
models are specifically tested on publicly traded U.S. firms. Financial data were from
Compustat3. Financial firms4 were excluded due to their unique accounting standards
and SEC disclosure requirements. Only firms listed on major U.S. exchanges (NYSE,
AMEX, NASDAQ, OTC) trading in U.S. dollars were included. This study follows
the sample selection procedure of Bao et al. (2020), except for the missing value
treatments. Details of the missing value treatment will be introduced in the next
subsection.
While their model used decision trees that rely on non-missing
values and surrogate splits for missing data, assuming missing data has little

3
https://www.spglobal.com/marketintelligence/en/?product=compustat-research-insight

4
This study defines the financial firms as the Financial Services in Compustat, which includes banks, insurance
companies, broker/dealers, real estate and other financial services.

Ref. code: 25655902320018OLX


16

predictive value, this study imputes missing values to balance using available data and
introducing noise. Experiments were done in Python, and the sample selection process
is shown in Table 3.2.

3.2.3 Missing value treatments


Under two scenarios, the entire firm observations will be dropped:
(1) firm observations with missing CIK, sales, or total asset values; (2) firm
observations with over 25% of variables’value missing. This study differs from Bao
et al. (2020) in data pre-processing, especially for missing values treatment. The
model in Bao et al. (2020) used Matlab and RUSBoost with decision trees relying on
non-missing values and surrogate splits for missing data, assuming the missing data
has no predictive value. However, ignoring observations with missing data may lead
to suboptimal results. A heristic cutoff of 25% is employed to utilize the majority of
the information in the data5.
For the rest of the data, missing financial data was handled using
complete case analysis and five imputation methods: zero, mean, modified mean,
KNN, and RF imputation. The zero, mean, and modified mean methods are statistical,
while KNN and RF are machine learning-based. Global mean and zero imputation are
common but may introduce bias here given the variables’ large standard deviations
(Allison, 2001). To alleviate this issue, this study introduces modified mean, which
refer to mean imputation applied to the values of the variables that are partially
missing and re-coded to be zero when a variable in the subset has empty values.
For machine learning approaches for missing data imputation, K-
Nearest Neighbor (KNN) (Duda & Hart, 1973) and Random Forest (RF) (Liaw &
Wiener, 2002) are selected to represent the clustering approach and represent the tree-
based one respectively (e.g., Purwar & Singh, 2015). The sample selection and
missing value treatment is presented in Table 3.2.

5
A more aggressive cutoff line for deleting observations with missing values is 50% in the work of Craja et al.
(2020).

Ref. code: 25655902320018OLX


17

Table 3.2: Sample selection

3.2.4 Labeling
This study relies on the disclosure of Accounting and Auditing
Enforcement Releases (AAERs) by the U.S. Securities and Exchange Commission
(SEC) to identify fraudulent financial reports for the following reasons: First, AAERs
have been widely used in research which has high requirement for the reliability of
fraud identification, such as for identifying fraudulent firms (e.g., Craja et al., 2020;
Fanning & Cogger, 1998; Feroz et al., 2000; Humpherys et al., 2011; Purda &
Skillicorn, 2015) and for misconduct research (Karpoff et al., 2017). The SEC applies
enforcement actions to significant fraudulent activities supported by substantial and
robust evidence. Thus, the AAER database is considered a trustworthy and consistent
source for accurately identifying fraudulent firms. Second, AAERs provide
information about enforcement actions taken against auditors, companies, or
individuals for violations of SEC and federal rules. These releases contain details
about the entities involved, the nature of misconduct, and the impact of such
misconduct on financial statements. Thus, choosing AAER as fraud proxy opens more
opportunity for future research.
In this study, the AAER database provided by the USC Marshall
School of Business is used to label fraudulent firms 6 . All firm-observations not
labeled as fraudulent in the database are considered non-fraudulent firms. Since the
number of labeled fraud firms in the AAER database is very limited before 2015 (less

6
USC Marshall School of Business provides AAER database compiled by Dechow, P. M., et al. (2011)

Ref. code: 25655902320018OLX


18

than 5), the year 2014 is chosen as the cutoff for testing the proposed models. The
distribution of fraud firms can be found in Table 3.3, providing an overview of the
fraudulent cases within the dataset.
Table 3.3: Distribution of fraud firms over 1991-2014

3.3 Experimental results

Section 3.3.1 describes the sample period and parameter settings for
the main experiments. Section 3.3.2 discusses the performance evaluation metrics,
and Section 3.3.3 presents experimental results.

3.3.1 Experimental setup


3.3.1.1 The sample period
The training period in this study spans at least 10 years from
1991, with a two-year gap for out-of-sample testing. The two-year gap is selected
based on the work of Dyck et al. (2010) which found that it typically takes around 2

Ref. code: 25655902320018OLX


19

years for fraud to be disclosed early. From Table 3.3, the number of fraud cases
reported in AAER has shown a noticeable decline since 2001. However, Donelson et
al. (2021) present different findings (Table 1, Panel C) regarding the number of
financial reporting fraud cases identified based on private enforcement during the
same period. This suggests a high possibility of false negatives in labeling fraud,
indicating that the percentage of undetected fraud may have increased since 2001.
Considering the decline in regulators' enforcement of accounting fraud since the 2008
financial crisis, this study ends the test period in 2008 to align with Bao et al. (2020)
for comparison purposes. It is important to note that due to undetected fraud, the
optimal test period for performance evaluation is considered to be from 2003 to 2005.
Therefore, this study presents average results for both the 2003-2005 and 2003-2008
periods to assess the robustness of the prediction models.
3.3.1.2 Parameter settings
The experiments in this study utilized two specific classifiers7:
the "RUSBoostClassifier" from the imbalanced-learn Python Toolbox and the
"CascadeForestClassifier" from the Deep Forest Python package 8. Grid search was
employed to find the optimal settings for the two hyperparameters required in the
CascadeForestClassifier: the number of forests (the number of estimators in each
cascade layer) and the number of trees in each forest (the number of trees in each
estimator).The resulting number of estimators is set to be 2 and the number of trees in
each estimator is 100. To achieve bootstrap class weighting for cost-sensitive learning,
the class_weight argument was set to 'balanced_subsample' in each estimator. 5-fold
cross-validation was applied to each of the four base classifiers to alleviate the
overfitting issue. The experimental workflow is summarized in Figure 3.1.

7
Note that the data preprocessing does not include variable scaling (e.g. normalization). Both RUSBoost and the
proposed model are decision-tree based ensemble learning algorithms. As the learning of a decision tree is based
on rules, both algorithms are not sensitive to variable scaling. Firm observations were fed to the models after
preprocessing as in Section 4.
8
Documentation of Python package “Deep Forest”: https://deep-forest.readthedocs.io/en/latest/index.html

Ref. code: 25655902320018OLX


20

3.3.2 Evaluating metrics


The performance metrics used in this study are: Area Under the
Curve (AUC) (Bradley, 1997), Normalized Discounted Cumulative Gain (NDCG)
(Järvelin & Kekäläinen, 2002), recall at the top 1%, precision at the top 1% and F1 at
the top 1%. AUC is the primary evaluation metric. It refers to the area under receiver
operating characteristic (ROC) curve which is a probability set plotted by True
Positive Rate (TPR) against the False Positive Rate (FPR). AUC serves as the primary
evaluation metric. It is calculated by measuring the area under the Receiver Operating
Characteristic (ROC) curve. The ROC curve represents the True Positive Rate (TPR)
plotted against the False Positive Rate (FPR). A higher AUC indicates a better ability
of the classifier to distinguish between fraudulent and non-fraudulent firms. AUC is
commonly used in the fraud detection literature due to its effectiveness in capturing
the performance of the models. In the case of imbalanced class distribution, such as in
fraud detection, AUC is preferred over accuracy as accuracy can be high even with a
naive guess of the majority class. Therefore, AUC provides a more reliable measure
of classification performance in imbalanced datasets.

Ref. code: 25655902320018OLX


21

In addition to evaluating fraud detection models based on


classification performance, it is also important to assess their performance in terms of
ranking the predicted likelihood of fraud. Since investigating fraud can be costly, it is
more practical to present a shortened list of suspicious financial statements from the
top decile of risk, determined by the probability of fraud. To measure the model's
prediction performance at the top of the probability distribution, metrics commonly
used in information retrieval literature are employed, including Normalized
Discounted Cumulative Gain (NDCG), recall@k, precision@k, and F2-score@k. The
threshold for evaluating the model's performance is set at the top 1% of the
probability scores, as the average proportion of fraud in the observed years is below
1%. By focusing on the top 1% of the probability distribution, the evaluation captures
the model's ability to identify the most suspicious cases of potential fraud.
NDCG (Normalized Discounted Cumulative Gain) is a widely used
metric for ranking problems9. It assigns higher values when true positive (fraudulent)
observations are ranked higher than negative observations, thus capturing the model's
ability to prioritize true fraud cases. The F-score is a performance metric calculated as
the harmonic mean of precision and recall. Precision measures the accuracy of
positive predictions, while recall indicates the classifier's ability to correctly classify
relevant results. The F-score provides a balanced evaluation of classifiers, considering
both precision and recall, and reflects their ability to accurately identify fraudulent
observations. In the context of fraud detection, the cost of a type II error (failing to
identify fraudulent statements) is typically considered to be higher than the cost of a
type I error (misclassifying non-fraudulent statements). This cost asymmetry is
highlighted in Hajek and Henriques (2017). To account for this, an F2-score is
employed, which places more weight on recall than precision. This metric provides a
performance measure that considers the trade-off between precision and recall,
emphasizing the classifier's ability to detect fraud while maintaining a reasonable
level of precision. The results of the evaluation will include recall, precision, and F2-
score, using the top 1% of firms with the highest fraud probabilities as the second set

9
Bao, Ke et al. (2020) argue that the setting of threshold k can be important for NDCG. This study performed the
sensitivity analysis of NDCG’s threshold setting at 0 5%, 1%, 1 5% and 5% Experimental results show that a
higher threshold does not always lead to higher NDCG score, suggesting the NDCG score is insensitive to the
setting of k in the experiments.

Ref. code: 25655902320018OLX


22

of measuring metrics. This evaluation provides insights into the classifier's


performance in identifying fraud among the most suspicious cases.

3.3.3 Experimental results and analysis


Table 3.4 presents the summarized results of the main experiments
conducted in this study. In Panel A, the experimental outcomes for the complete case
analysis deployed on RUSBoost and CSCF are displayed. Panels B-F provide a
comparison of different missing data imputation approaches. The results clearly
indicate that CSCF outperforms RUSBoost significantly, particularly in terms of the
primary performance metrics AUC and NDCG, both for individual predicted years
and aggregated test periods. This suggests that CSCF exhibits superior classification
capabilities in distinguishing between fraudulent and non-fraudulent observations. A
noteworthy observation is that CSCF yields considerably fewer zero values in the
ranking performance metrics (NDCG, recall@k, precision@k, and F2@k) compared
to RUSBoost. A zero value in these metrics indicates that the top 1% of predictions do
not provide any valuable information in that specific year. With less zero values in the
testing period, CSCF seems more reliable compared to the baseline model by offering
more consistent and informative predictions. Overall, the findings strongly support the
superiority of CSCF over RUSBoost, both in terms of classification performance and
ranking capacity for fraud detection purpose.
Comparing different missing data imputation methods, the lowest AUC
scores for both models in the aggregated testing periods of 2003-2005 and 2003-2008
are when using the complete case analysis approach (Panel A). This suggests that
complete case analysis diminishes the ability of RUSBoost and CSCF to effectively
differentiate fraudulent reports from non-fraudulent ones. The ranking quality metrics
(primary NDCG) leads to a similar conclusion, where highest NDCG scores for both
models are achieved when missing data is handled through imputation, rather than
using the complete case analysis approach. This finding highlights the importance of
considering imputation methods for missing data in future research. Prior studies(e.g.,
Cecchini et al., 2010a; Craja et al., 2020; Hajek & Henriques, 2017) which employed
complete case analysis to handle missing values might easily improve their prediction
performance by deploying an alternative impution strategy for missing values.

Ref. code: 25655902320018OLX


23

Table 3.4: Experimental results

Ref. code: 25655902320018OLX


24

The impact of missing value treatments differs between RUSBoost and


CSCF. For RUSBoost, the optimal AUC, NDCG, and other performance metrics
(recall@k, precision@k, and F2@k) are achieved when missing data is handled
through K-nearest neighbors (KNN) imputation, mean imputation, and complete case
analysis, respectively. On the other hand, CSCF performs best when employing zero
imputation for missing data across all selected evaluation metrics.
Notably, CSCF with zero imputation yields the highest performance
across all evaluation metrics. For instance, the average AUC of CSCF using zero
imputation for missing values in the 2003-2005 test period reaches 0.82, surpassing
the highest AUC obtained by RUSBoost with KNN imputation (0.71) as well as the
AUC reported by Bao et al. (2020) (0.753), where RUSBoost disregards missing
values during tree learning. The improvement in ranking performance with CSCF is
even more remarkable. While Bao et al. (2020) reported an NDCG of 0.085 using
RUSBoost for the 2003-2005 test period, CSCF with zero imputation achieves an
NDCG of 0.99 for the same period.

Ref. code: 25655902320018OLX


25

3.4 Supplement analysis

3.4.1 The size effect of the training sets


The size effect of the training sets is related to the effectiveness of
models and missing data treatments in various domains, including motorway traffic
flow forecasting (Chen et al., 2001), photovoltaic systems monitoring (Koubli et al.,
2016), and breast cancer(Jerez et al., 2010). Since there is no consensus on the
optimal training set's size (the number of years used to train the models) for financial
statement fraud studies, this study explores the impact of this size effect by
progressively reducing the number of years' observations. The testing period is fixed
from 2003 to 2005 when the fraud proxies are most reliable. For instance, with an
initial training size of 11 years, the training period for the test year 2003 spans from
1991 to 2001, for the test year 2004 it spans from 1992 to 2002, and for the test year
2005 it spans from 1993 to 2003. Figure 3.2 illustrates the average AUC score of
RUSBoost and CSCF using different imputation methods during the testing period.
From Figure 3.2, one can observe that there is no discernible trend
in the AUC scores across the seven progressive training sizes. In other words, the size
effect of training set is slight in the current case. Rather, AUC values is more related
to the choice of model and imputation methods employed. Consistent with the
previous findings, both RUSBoost and CascadeForest exhibit poor performance when
missing values are treated with complete case analysis. The AUC scores for
AUCNoMiss and AUCNoMiss_DF are consistently the lowest among the different
training sizes for their respective models. This disparity is more pronounced in the
case of RUSBoost, where AUCNoMiss performs significantly worse than the other
imputation methods across all training sizes.
Another finding consistant with the ones of the main experiments is
that the proposed CascadeForest (CSCF) consistently outperforms RUSBoost in terms
of AUC, in all the training size tested. Even the poorest performance of CSCF using
complete case analysis surpasses the best performance of RUSBoost. This further
highlights the superior and more robust performance of CSCF across different
training period lengths.

Fig 3.2 Effect of training size on models’ AUC with missing value treatments

Ref. code: 25655902320018OLX


26

3.4.2 Feature importance of cost-sensitive cascade forest


Since CS is a tree-based ensemble learning model, it can calculate
feature importance due to their inherent structure and training process. Feature
importance is calculated based on the impact of each feature on the performance of
the ensemble model. By calculating the feature importance, one can know which
features contribute the most to the model performance and thus improve the model
interpretability.
The proposed CSCF model comprises multiple layers with forests,
allowing for the calculation of overall feature importance. This is achieved by
averaging the feature importance scores across each forest and layer. Table 3.5
showcases the top-10 financial variables ranked by their importance scores in the
best-performing CSCF model (zero missing data imputation) from the main
experiments. In this study, each financial variable in Table 3.5 is linked to the

Ref. code: 25655902320018OLX


27

corresponding account category identified by Dechow et al. (2011) and mapped to its
frequency ranking as reported in Bao et al. (2020).
Interestingly, there is a noteworthy overlap between the top-10
financial variables listed in Table 3.5 and the account categories most impacted by
misstatements. Specifically, the four most influential predictors of fraud, namely "sale
of common and preferred stock," "common shares outstanding," "price close, annual,
fiscal," and "common/ordinary equity, total," are all associated with the account
category "inc_exp_se." It is important to note that "inc_exp_se" encompasses a
mixture of affected accounts that defy classification as income, expense, or equity
accounts. This compelling evidence indicates that the proposed model demonstrates
the capability to identify crucial financial variables that are frequently influenced by
misstatements.

Table 3.5 Feature importance of the top-10 financial variables and the related accounts
affected by the detected frauds
Related
Average score of Rank (relative frequency) of
indicator
feature importance the indicators affected by the
Rank Raw financial variables variables in
for the test period detected frauds over the test
Dechow et
2003-2008 period 2003-2008
al.(2011)
Sale of Common and
1 0.0566 inc_exp_se 1 (31.8%)
Preferred Stock
Common Shares
2 0.0366 inc_exp_se 1 (31.8%)
Outstanding
3 Price Close, Annual, Fiscal 0.0359 inc_exp_se 1 (31.8%)
Common/ Ordinary Equity,
4 0.0334 inc_exp_se; res 1 (31.8%); 5 (5.5%)
Total
Cash and Short-Term
5 0.0326 asset 3 (11.8%)
Investments
6 Current Assets, Total 0.0314 asset 3 (11.8%)
7 Sales/turnover (net) 0.0310 rev 2 (23.9%)
Property, Plant, and
8 0.0308 asset 3 (11.8%)
Equipment, total
9 Receivables, Total 0.0306 rec 4 (8.6%)
10 Inventories, Total 0.0306 inv 5 (5.5%)
The definition of indicators variables is shown in Appendix A.

3.4.3 The effect of more financial variables


Bao et al. (2020) posed an interesting question regarding whether
the scale of data matters to the performance of fraud prediction model. In their work

Ref. code: 25655902320018OLX


28

applying RUSBoost to a complete set of raw financial variables (including 266 raw
data items) drew a negative response to this question. As discussed in section eariler,
RUSBoost in Matlab addresses missing data using a decision tree mechanism.
Consequently, the algorithm relies heavily on the available data for prediction,
introducing a bias towards the available data. Thus, such bias may be exaggerated
when the scale of missing values is large. Numerous studies in the field of data
mining have demonstrated that statistical or supervised learning imputation methods
outperform surrogate splits in tree-based learning. This effect is particularly
pronounced when the training data exhibits a high incidence of missing values (Saar-
Tsechansky & Provost, 2007; Twala, 2009).
To further investigate the question, the study follows the same
procedures as Bao et al. (2020) in relevant section using both RUSBoost and CSCF
with different missing value treatment. Data are obtained from Compustat's
"Fundamentals Annual" database10. A list of 187 financial items is used as feature
inputs after excluding variables with more than 25% missing values. For simplicity, a
straightforward approach of replacing missing values with zeros was utilized. The
experimental results are presented in Table 3.6.
Table 3.6 provides empirical evidence that supports the concern
regarding the misleading outcomes caused by the inherent splitting mechanism in
Matlab's RUSBoost algorithm. In Panel A, the results are significantly poorer when
using the same algorithm with zero imputation for missing data, casting doubt on the
conclusion that incorporating additional financial variables does not enhance fraud
prediction. However, when zero imputation is applied in Panel B and C, both models
demonstrate improved performance, with the proposed CSCF consistently
outperforming RUSBoost, aligning with the earlier findings.

Table 3.6: Experimental results using 187 variables

10
Fundamentals Annual provides a comprehensive set of annual fundamental data including 710 financial
statement items and 131 miscellaneous and supplemental items.

Ref. code: 25655902320018OLX


29

Panel A: RUSBoost without missing value treatment (Baseline model)


AUC NDCG Recall@k Precision@k F2@k
2003 0.56 0.05 3.75% 5.26% 0.04
2004 0.67 0.05 4.84% 5.36% 0.05
2005 0.69 0.10 8.51% 7.41% 0.08
2006 0.65 0.04 6.25% 3.85% 0.06
2007 0.57 - - - -
2008 0.59 - - - -

2003-2005 0.64 0.06 5.70% 6.01% 0.06


2003-2008 0.62 0.04 3.89% 3.65% 0.04

Panel B: RUSBoost with zero imputation (Baseline model)


AUC NDCG Recall@k Precision@k F2@k
2003 0.71 0.15 6.25% 5.26% 0.06
2004 0.75 0.10 9.68% 11.54% 0.10
2005 0.70 0.10 4.26% 3.70% 0.04
2006 0.76 0.02 3.13% 1.92% 0.03
2007 0.68 0.05 6.45% 4.00% 0.06
2008 0.69 0.03 4.76% 2.13% 0.04

2003-2005 0.73 0.12 3.98% 2.28% 0.07


2003-2008 0.72 0.07 5.75% 4.76% 0.05

Panel C: CSCF with zero imputation (The proposed model)


AUC NDCG Recall@k Precision@k F2@k
2003 0.83 0.41 10.00% 61.54% 0.12
2004 0.84 0.39 12.90% 61.54% 0.15
2005 0.84 0.38 14.89% 41.18% 0.17
2006 0.82 0.24 9.38% 21.43% 0.11
2007 0.78 0.23 3.23% 12.50% 0.04
2008 0.73 0.37 9.52% 40.00% 0.11

2003-2005 0.83 0.39 12.60% 54.75% 0.15


2003-2008 0.81 0.34 9.99% 39.70% 0.12

Feature selection is important for cost-effective execution in


practice. This study ranks the importance scores of 187 financial variable and test the
effect of important features. Figure 3.3 reported the average AUC of testing period
2003-2005 and the average computational time per layer11 for 5 subgroups with top
28, 50, 100, 150 important features as inputs for training.

11
The computational environment in this study is: Intel(R) Core(TM) i7-10700KF [email protected]; RAM 32GB; NVIDIA
GeForce RTX 3070.

Ref. code: 25655902320018OLX


30

Fig 3.3 AUC vs Time per Layer (in seconds) for various feature set sizes

Figure 3.3 reinforces the notion in data mining literature that


additional features come with computational costs, and indiscriminate inclusion of
more features can introduce noise and not necessarily improve AUC. Interestingly,
even within the same number of features, the AUC obtained from the main
experiments (0.82) is higher compared to the AUC achieved with a subset of the top
28 features (0.79), and it is similar to the AUC obtained using the top 50 features.
Therefore, careful feature selection guided by theoretical insights is crucial for
effective feature selection in developing a cost-effective fraud detection model.

3.5 Conclusion

This study proposes a new financial fraud detection tool by modifying a


recently proposed machine learning algorithm called Deep Forests or Cascading
Forests. The study distinguished itself from previous benchmark works through the
use of a novel learning algorithm, alternative missing value treatments, and
comprehensive supplemental analyses. Samples with missing values no more than 25%
were retained for analysis. The remaining missing values were treated using complete
case analysis and imputed using statistical approaches (zero, mean, and modified
mean imputation), and machine learning techniques (KNN and RF).

Ref. code: 25655902320018OLX


31

The experimental results consistently showed a remarkable enhancement


in the performance of the proposed cost-sensitive cascade forest fraud (CSCF)
detection model when compared to the benchmark model introduced by Bao et al.
(2020). This significant improvement was observed across various test periods, with
different missing data treatments and training sizes. Moreover, the findings highlight
the crucial role of imputation methods for missing value treatment in model
performance as opposed to relying solely on complete case analysis.

3.5.1 Implications for Research


There are at least two important implications. Firstly, the study
demonstrates the effectiveness of cost-sensitive modeling in addressing the
classification challenges posed by imbalanced data, particularly in the context of fraud
detection where misclassification costs vary significantly between classes. The
proposed model and its results can serve as a valuable baseline for future studies in
cost-sensitive classification. Secondly, the research emphasizes the need for
researchers to carefully consider the impact of various missing data treatments on the
performance of fraud detection models. The treatment of missing data has been
largely overlooked in the fraud literature, and this study reveals that the choice of an
appropriate missing data treatment can significantly enhance the accuracy of financial
fraud detection. Therefore, it is recommended that future researchers explore and
conduct robustness tests on different missing data treatments to ensure the reliability
and robustness of their fraud detection models.

3.5.2 Implications for Practice


This study suggests to address the challenge of missing values in
real-world fraud detection applications with a proposed cost-sensitive cascade forest
model whose performance can be further enhanced with appropriate missing data
imputation. By analyzing the feature importance, the study reveals that the most
significant predictors identified by the proposed model align closely with the account
categories that are most frequently impacted by misstatements. It suggests that the
proposed model can not only detect fraudulent activities but also identify specific
account categories that are more likely to be affected by misstatements. This

Ref. code: 25655902320018OLX


32

additional capability adds further value to the proposed model in assisting


stakeholders in their fraud detection efforts.

3.5.3 Limitations
The primary limitation of this study lies in the use of imperfect
fraud proxy measures, which pose challenges in accurately representing the
underlying construct of fraud. The performance of fraud detection models can vary
depending on the choice of fraud proxy, highlighting the need for further research in
developing more robust and comprehensive fraud proxies.
Another limitation of this study is the exclusive use of structured
data from public financial statements for the fraud detection model. It does not
consider the potential insights that may be derived from other forms of structured or
unstructured data, such as employee messages. Incorporating additional data sources
could provide valuable information for enhancing the prediction of fraud. Exploring
the integration of diverse data types in future research could lead to more
comprehensive and accurate fraud detection models.

3.5.4 Future Work


The use of a simple inverse-proportions-based cost matrix for cost-
sensitive learning in this study offers a starting point for cost-sensitive learning in
fraud detection. Further exploration and optimization of the cost matrix can
potentially lead to improved learning outcomes. Future researchers can investigate
different cost matrix designs and optimization techniques to enhance the effectiveness
of cost-sensitive learning in fraud detection.
The supplementary analysis conducted in this study emphasizes the
importance of theoretical guidance in the process of feature selection for designing
effective fraud detection models. The findings suggest that incorporating theoretical
insights and domain knowledge can lead to more reliable and robust feature selection,
ultimately improving the performance of fraud detection models. Future researchers
are encouraged to explore theory-motivated feature selection for more effective fraud
detection.

Ref. code: 25655902320018OLX


33

Given the significant class imbalance in fraud detection tasks, where


fraudulent instances are typically less than 1% of the total instances, the utilization of
Generative Adversarial Networks (GANs) presents a promising approach to address
this challenge. By using GANs to generate synthetic instances of fraud, the training
dataset can be augmented with diverse and realistic fraud patterns. This augmentation
has the potential to enhance the performance of fraud detection models by enabling
them to learn from a more comprehensive and representative set of fraudulent
examples. The integration of GANs in fraud detection models can contribute to
improved fraud prediction capabilities in real-world applications.

Ref. code: 25655902320018OLX


34

CHAPTER 4
STUDY 2: ACCOUNTING CONTROL ISSUES IDENTIFICATION

Ineffective internal control can result in various negative consequences


for a company, its shareholders, and even the broader economy. On March 14, 2023,
Credit Suisse Bank disclosed in its annual report that it had identified "material
weaknesses" in internal controls over financial reporting. This outbreaking news
immediately triggered a global market panic with substantial customer outflow and a
sharp decline in stock price. Within less than a week, the bank was acquired by its
rival UBS, Switzerland’s largest bank, in a deal orchestrated by the Swiss government
to resume investor confidence and secure financial stability. The case of Credit Suisse
Bank is just one recent case of internal control failures. In other cases, ineffective
internal control is often associated with accounting errors, fraud, non-compliance with
laws and regulations, and financial misstatements.
Information regarding the effectiveness of internal controls is important.
However, such information is extremely limited. The only regular information source
is the internal control reports contained in annual reports for compliance with the
Sarbanes-Oxley Act (SOX). The internal control reports provide an overview of the
internal control system's design and effectiveness in a firm, and report any material
weaknesses identified during the audit process. As investors rely on accurate and
reliable financial information to assess their investment's risk and potential returns,
early warning of material weaknesses is important for them to assess the quality of
financial reporting and evaluate the risks associated with their investment decisions.
For companies, identifying early signs can enable them to identify and address control
weaknesses before they escalate into significant issues.
Online employee reviews are usually written by current and former
employees anonymously, allowing employees to share their experiences and opinions
without fear of retaliation from their employers. Manipulating online employee
reviews on a large scale is extremely difficult. Several recent studies have utilized
online employee reviews to capture unfiltered and candid opinions of current and
former employees regarding their employers, and to develop constructs such as

Ref. code: 25655902320018OLX


35

corporate culture (e.g., Corritore et al., 2020) and workplace environment (e.g., Li,
2022).
As every employee has responsibilities to take in an internal control
system, this study proposes that analyzing employee reviews may provide a glimpse
of a firm's internal control practices, through which practitioners and researchers may
benefit from obtaining an early signal of control problems for their decision-making.
Thus, the objective of Study 2 is to investigate the possibility of utilizing
online employee reviews as an alternative information source to indicate the internal
control practices in the companies. Upon exploring and confirming the potential of
using online employee reviews for the purpose of indicating internal controls (Section
4.1), an exclusive online employee reviews dataset is developed (Section 4.2), which
is then used to train the machine learning classifiers for identifying reviews with
mentions of control issues (Section 4.3). The current chapter is organized as follows:
Firstly, Section 4.1 explored the online employee reviews collected. It
verified the hypothesis that online employee reviews contain information that can be
used to infer accounting control issues, which serves as a prerequisite for subsequent
analysis. Section 4.2 describes the development of the online employee reviews
dataset. It includes how the data were collected, coded, cleaned, and verified. In
Section 4.3, the dataset created in the previous section (Section 4.2) was utilized for
training the classifiers to identify employee reviews with mentions of control issues.
An in-depth comparative analysis of the performance of a variety of machine learning
models on the issue was conducted, based on which a web application was developed.
The introduction of the app architecture, functionalities, and examples of webpage
layout can also be found in the same subsection. The last subsection (Section 4.4)
discusses the impact of the application and future work that can be undertaken
following the current study.

4.1 Data exploration

During the initial phase of data exploration, the main focus is to


validate the assumption that online employee reviews are informative for internal
control monitoring. If the assumption is affirmative, firms with significant control

Ref. code: 25655902320018OLX


36

issues (e.g., reported material weakness) are expected to exhibit a higher prevalence
of control-related problems in their employee reviews compared to those with minor
control issues (e.g., without reported material weakness). This comparison of the
prevalence of control-related problems in online employee reviews between firms
with and without reported material weakness serves as a prerequisite for the
subsequent analysis in Study 2.
The Internal Control Integrated Framework (2013) issued by the
Committee of Sponsoring Organizations of the Treadway Commission (“The COSO
framework”) is the most recognized guidance for internal control implementations
The COSO framework was therefore used as the basis for coding the dataset.
The data exploration began by consulting the 17 accounting control
principles of the COSO framework and comparing the proportion of relevant
employee reviews from two groups of sampled companies. Only principles of the
COSO framework whose implementation may be inferred directly from employees’
self-reports of working experiences were chosen. For example, an employee reporting
an observation of ethical violations (e.g., bribery) would relate to Principle 1
“integrity and ethical values ” There are a total of 9 principles identified. The topics
were extracted from the reviews and linked to the identified principles, as shown in
Table 4.1 (next page).
18 listed U.S. companies in different industries that have reported
internal control material weakness in the Audit Analytics Internal Control (AAIC)
database 12 were randomly selected. In order to also have data on firms without
reported internal control material weaknesses, the 18 selected companies were
matched with 18 other firms that had the closest asset size in the same industry but did
not have material weakness reports in AAIC. Appendix B shows the sampled
companies. Employee reviews from before the year in which material weaknesses
were reported were gathered for the selected companies and their respective matches.
To ensure the protocols are written in alignment with COSO, the author drafted
protocols for each principle, which were later reviewed and revised by a full auditing
professor.

12
https://www.auditanalytics.com/products/sec/internal-controls

Ref. code: 25655902320018OLX


37

Table 4.1: Prevalence of accounting control issues in random sample of 200 employee reviews from balanced set of 36 companies (18
companies with, and 18 companies without internal control material weakness)

Set A: Set B: non-


ICMW* ICMW**
COSO (# of positive (# of positive p-value (CHISQ
p-value (z test)
Components mentions /200) mentions /200) TEST)
Principle 1 Integrity and ethical values 11 7 0.335 0.124
Organizational structure,
Principle 3 5 2 0.253 0.033
authority, and responsibility
Principle 4 Destructive culture 8 7 0.792 0.700
Control (commitment to Competence(training) 9 7 0.610 0.442
environment competence) Performance appraisal 5 1 0.100 0.000
Accountability 10 14 0.399 0.268
Principle 5 Accountability(unrealistic
4 5 0.736 0.651
goals)
Principle 6 Specification of objectives 7 5 0.558 0.365
Risk assessment
Principle 7 Significant risk identification 4 5 0.399 0.651
General control over
Principle 11 4 0 0.044 0.000
Control activities technology
Principle 12 Policies and procedures 12 11 0.830 0.756
Information and Information and
Principle 13-15 22 13 0.111 0.010
communication communication
* ICMW represents the sample of the selected companies reported internal control material weakness
**Non-ICMW represents the sample of the matched companies, without reported internal control material weakness.

Ref. code: 25655902320018OLX


38

A sample of 200 employee reviews was constructed for each set of


companies (Set A: Companies with material weakness reports in AAIC, vs Set B:
Companies without material weakness reports in AAIC). One of the authors coded the
reviews based on the protocols. Table 4.1 shows the results of the sampling, which
suggests the number of flagged reviews is significantly different between the two
groups of companies for COSO Principle 3 (Organizational structure, authority, and
responsibility), Principle 4 (Performance appraisal), Principle 11 (General control
over technology), and Principle 13-15 (Information and communication), using
Pearson's chi-squared test of significance. Hence, the coding for these principles was
prioritized, as Table 4.1 presents prima facie evidence that mentions of these specific
principles are higher for companies with vs without formal material weakness reports
in AAIC.

4.2 A dataset of online employee reviews for identifying accounting control issues

Figure 4.1 demonstrates the data collection workflow for the


dataset. Specifically, Section 4.1 (Data exploration and preparation) tackles the
important question of whether online employee reviews can serve as a substitute data
source for internal control monitoring. Section 4.2 provides a comprehensive
overview of the procedures for creating the exclusive online employee dataset, which
encompasses collecting, coding, and cleansing the reviews.

Ref. code: 25655902320018OLX


39

Coding employee reviews can be a challenging task due to a variety


of issues that can arise. (1) Language complexity: Employee reviews can contain a
wide variety of language complexity, including technical terms, jargon, and idiomatic
expressions. This requires coders to have a strong understanding of the language and
remain dedicated to the task, particularly if the reviews contain slang or industry-
specific jargon; (2) Subjectivity and bias: Employee reviews can be highly subjective
and biased. Coders should apply their professional judgments to determine whether a
review reflects a broader control issue within the firm or is a more personalized and
potentially biased account of individual working experiences; (3) Data completeness:

Ref. code: 25655902320018OLX


40

In some cases, employees may leave incomplete or vague feedback. Professional


judgment is again necessary to appropriately label such reviews. Hence, coders were
carefully chosen. All the coders, including the authority coders, have accounting
education backgrounds with overseas education experiences in English-speaking
countries or enrolled in an international program.
To guide coding efforts, three separate and coherent protocols were
developed: one protocol for Principle 3, one protocol for Principle 4, and one Protocol
for Principles 11, 13-15. Three accounting graduate students with English as their
second language participated in the pilot test (simulating the coding process).
Modifications of the protocols were made to ensure the protocols were understandable
for later enrollees in the coding process. Appendices C, D, and E provide the resulting
three coding protocols.
A dataset of random reviews from the selected 18 companies with
internal control material weakness records in AAIC was created. Pamtag13, a web-
based annotation tool, was used for coding. 117 senior students in a Triple Crown
Accredited accounting international program in Thailand participated in the coding at
the end of their Internal Control course. To improve the accuracy of coding, the
enrolled coders were randomly assigned into three groups, with each group focusing
only on a single, coherent protocol formulated from one of the three shortlisted
protocol sets: one protocol for Principle 3, one protocol for Principle 4 (performance
appraisal), and one protocol for Principle 11, 13-15 (Information and
communication).
Students were required to pass the pilot test with a passing score of
80% (20/25) before moving to the next phase. Each student was randomly assigned to
label at least 300 reviews according to their one assigned protocol. Each student was
assigned only to a single protocol to ensure their attention was focused only on a
single binary code (mention, or not, of a single Principle or single set of coherent
Principles.) Five students failed to complete the tasks, and the remaining student
coders passed the pilot tests and completed the tasks. The author and two full
professors in auditing served as the authority coders. Due to unsatisfactory inter-rater

13
https://pamtag.pamplin.vt.edu

Ref. code: 25655902320018OLX


41

reliability amongst undergraduate student coders on Principle 3, 4 and 1114, only the
undergraduate group coding for Principle 13-15 “Information and communication”
(34 coders) was kept, and the senior authority coders labeled the full sample of
reviews to detect mentions of Principle 315.
For Principle 3: “Organizational structure, authority, and
responsibility”, each of the two authority coders (the author and a senior
collaborator16) labeled 2,000 reviews. Cohen’s kappa (κ) (Cohen, 1960)was used to
evaluate inter-rater reliability. κ =0.650 between the two authority coders was
observed, suggesting substantial inter-rater agreement, according to Landis and Koch
(1977). A full professor in auditing served as a third authority coder and labeled the
disagreements. The final labels represent the consensus of all three authority coders
after discussion. The coding process identified 194 reviews with mentions of Principle
3 Organizational structure, authority, & responsibility.
For Principles 13-15: “Information and communication”, the labels
created by the student coders were carefully reviewed. Students coders with total
positive mentions (of Principle 13-15) that were two or more times the standard
deviation away from the mean of the number of positive mentions (of Principle 13-
15), were considered rogue coders. A total of six rogue coders were removed. Another
two coders were disqualified because they failed to follow the protocol, which
required capture (copy-and-paste) of the specific paragraph(s) describing the
identified control issue mentioned in the review. After eliminating rogue and
disqualified coders, twenty-six (26) coders remained, with 8,195 reviews coded.
Since the labeling (coding) is randomly assigned in Pamtag, some
reviews were coded more than once. Following the examples of prior research
(Abrahams, Fan, Wang, Zhang, & Jiao, 2015;Abrahams, Jiao, Wang, & Fan,

14
Even though one group of student coders were assigned to code for mentions of Principle 11 and 13-15
(Information and communication), these two types of control issues were treated separately when calculating the
inter-rater reliability. Principle 11 was dropped because of unsatisfactory inter-rater reliability amongst
undergraduate student coders, for the same reasons as for Principle 3 and 4.
15
Each of the two authority coders coded 1000 reviews for Principle 4. However, only 66 unique positive records
were identified. Due to budget constraints, Principle 4 was dropped from the dataset.
16
The collaborator is an assistant professor in the field of auditing who received her Master's degree and PhD from
universities in the UK and Hong Kong, respectively.

Ref. code: 25655902320018OLX


42

2012;e.g., Goldberg & Abrahams, 2018), the labels were determined by the majority
vote for reviews coded more than twice. A conservative strategy is applied for the
reviews coded twice: that is, if one of the two coders of a single observation had
observed an accounting control issue, that coder’s label was taken as overriding the
coder who did not believe an accounting control issue was mentioned.
Each of the two authority coders coded 1,300 reviews. The inter-
rater reliability between the two authority coders is κ=0.633, indicating “substantial”
according to Landis & Koch (1977). A third authority coder coded the reviews with
disagreements. All three authority coders discussed the disagreements, and a
consensus was reached to form the final labels for the records coded by authority
coders. Next, the records of positive mentions were merged with the ones coded by
student coders. The inter-rater reliability between authority and student coders is κ
=0.601, regarded as moderate agreement by Landis and Koch (1977). The resulting
dataset contains 669 identified reviews with mentions of Principle 13-15 “Information
and communication”.
The dataset can be accessed via a Mendeley Dataset Repository at:
https://data.mendeley.com/datasets/zw2shn7pv3/1

4.3 A web application for accounting control issues identification

Section 4.1 showed that online employee reviews could provide vital
insights into internal control practice in a firm, at least for certain, isolated principles
(specifically, Principles 3 and 13-15). However, determining whether a single
employee review suggests potential accounting control issues requires considerable
professional judgment, and the volume of contents can make identifying relevant
reviews difficult and tedious.
This study developed a web-based automated control issue detection tool
for users to identify (from a restricted list: Principles 3 and 13-15, again) online
employee reviews with mentions of relevant control issues. Users can upload their
collected reviews, select the type of control issues to identify, specify the predictive
models of their choice, and download the results. The web application will automate
the classification of the uploaded file and directly rank the reviews in descending

Ref. code: 25655902320018OLX


43

order by scores or prediction confidence, where higher scores or confidence indicate a


higher likelihood of being of the target class. A decision-maker can employ their
professional judgment and begin with the highest-ranking reviews, which are most
likely relevant, and proceed to read them in descending order until they feel they have
exhausted the valuable reviews. The rest of the section will present the selection of
text classification models, describe and illustrate the web application’s architecture
and functionalities, and display the page layout.

4.3.1 Text classification methods


Identifying accounting control issues from online employee reviews
can be considered as both a binary classification problem, where reviews are
classified into “with mentions” or “with no mentions” of control issues, and as a
ranking problem, where top-ranking reviews represent the most relevant ones to the
target class observations (reviews with mentions of accounting control issues). Thus,
this study trained the selected machine learning models to perform the binary
classification using the dataset developed in Section 4.1, and ranked the reviews based
on the prediction confidence. Popular machine learning algorithms chosen to perform
the task include logistic regression, decision tree, Support Vector Machine (SVM)
(Cortes & Vapnik, 1995), and random forest (RF) (Breiman, 2001). This study also
deployed four recent pre-trained word embedding models for the tasks: Word2Vec
(Mikolov et al., 2013), GloVe (Pennington et al., 2014), BERT (Devlin et al., 2018),
and GPT-2 (Radford et al., 2019). These models were employed as a word embedding
layer for the bi-directional Long Short-Term Memory (LSTM) model, a commonly
used machine learning model in binary text classification tasks. A grid search was
performed to tune the hyperparameters for all the models tested.
However, while these models might deliver desired performance,
their results can be difficult to interpret due to the limited information about how the
models arrive at their predictions or decisions. Low interpretability can make it
challenging for users to trust the results of machine learning models or to identify
errors or biases in the models. To facilitate the interpretation, this study introduced the
“smoke term” analysis(Abrahams et al., 2012; Goldberg & Abrahams, 2018) as an
alternative method for the task. Smoke terms are the prevalent words and phrases in a

Ref. code: 25655902320018OLX


44

target class of interest and infrequent otherwise. Compared to the alternative machine
learning approaches, such as sentiment analysis and deep learning word embedding
models, the most substantial advantage of smoke terms is their interpretability. Smoke
terms can assist the decision-makers in understanding the model’s process with
emphasized prevalent words or phrases. Originating from research efforts to mine
product defect-related information from online consumer reviews (e.g., Abrahams et
al., 2012), the smoke-term technique has been applied to mine safety concerns-related
information from online consumer reviews (e.g., Abrahams et al., 2013; Abrahams et
al., 2012; Winkler et al., 2016) service quality detection in hospital (Zaman et al.,
2020), and mortgage origination delay detection in financial services (Brahma et al.,
2021; Goldberg et al., 2022).
Fumeus, the smoke term analysis tool developed by Goldberg,
Gruss et al. (2022) is utilized to develop the web-based application. There are two
core functions in Fumeus: smoke term generation and smoke term scoring. The
“smoke term generation” function generates a weighted dictionary of smoke terms
based on the given dataset, which then can be used to compute the smoke term scores
for unseen records in a textual dataset The “smoke term scoring” function returns a
file with records sorted by the smoke term scores in descending order, as well as
corresponding smoke term scores and a list of all smoke terms for each record. There
are two hyper-parameters to be set in Fumeus: the length of the smoke terms
(unigrams, bigrams, or trigrams) and the information retrieval metrics (“Correlation
Coefficient”, “Robertson’s Selection Value”(Robertson, 1986), “Relevance
Correlation Value” (Fan et al., 2005), and “Document and Relevance
Correlation”(Fan et al., 2005) to derive smoke terms.

Table 4.2 Top 20 unigrams retrieved by Correlation Coefficient score


Principle 3 Principle 13-15

Ref. code: 25655902320018OLX


45

Word Weight Word Weight


1 structure 3938.95 communication 14188.26
2 disorganized 2292.67 between 5377.50
3 organization 2257.26 communicate 3876.77
4 unorganized 2226.08 information 3876.77
5 org 1988.82 lack 3763.15
6 approval 1720.43 listen 3296.28
7 responsibilities 1720.43 communications 3072.63
8 confusion 1569.64 disconnect 2989.57
9 dysfunctional 1569.64 disconnected 2816.51
10 approvals 1569.64 departments 2570.86
11 reports 1520.09 feedback 2264.89
12 manager 1512.55 poor 2119.55
13 matrix 1403.14 cross 1985.71
14 description 1403.14 miscommunication 1985.71
15 silo 1403.14 listening 1985.71

Table 4.2 presents the top 20 unigrams identified by the Correlation


Coefficient score algorithm, which retrieved the most true positive records during the
hyperparameter grid search. Many words appear intuitive to provide the greatest prediction.
Recall that a review is considered mentioning a potential accounting control issue relating
to Principle 3 when “the decisions and actions of different parts of the organization
are coordinated in a way that helps the department to meet its objectives. Authority
and responsibility are clearly defined and consistent with the division/department’s
objectives, so the right people can make decisions and take actions ” The terms
“structure”, “disorganized”, “unorganized”, “dysfunctional”, “silo” appear to convey
employees’ perception regarding whether the firms are effectively structured and
organized themselves to achieve their objectives. Other terms, such as “approval”,
“matrix” seem to infer the authority matrix or approval matrix in a firm, and the terms
“responsibilities” and “description” seem to be more intuitive to indicate the reviews
contain comments of the job description and the responsibilities assigned to the
employees. The term “confusion” might be mentioned by the employees when the
authority and responsibilities are not clearly defined.
In the top 20 unigrams for Principle 13-15, almost all terms seem to
indicate the reviews described how the employees think about the information flow
and communication, either upstream, downstream, cross-functionally, or externally, in

Ref. code: 25655902320018OLX


46

the firms17. As illustrated above, the smoke terms well explain and communicate the
retrieval rationale of the method.

4.3.2 Evaluation metrics


Standard classification performance metrics like AUC, Recall,
Precision, and Accuracy are used to evaluate the classification performance of the
selected models. In addition, this study uses MARS metrics (MARS ShineThrough
and MARS Occlusion scores) (Mali et al., 2022; Restrepo et al., 2022) as new
performance metrics to evaluate the classifiers’ prediction exclusivity compared to
other classifiers. MARS ShineThrough score reflects the classifier's capacity to
identify true positives that are unique and not detected by other classifiers, while
MARS Occlusion scores reflect the classifier’s incapacity in missing false negatives
that other classifiers detect correctly as true positives. The MARS metrics can be
computed for either an individual model or a combination of two models. In the case
of combined models, the score is determined by combining the predictions of two
models and the correct class labels will be prioritized whenever possible. The
performance of combined models can be evaluated by comparing the confusion
matrix that shows the MARS score for individual classifiers and the combined MARS
score for two classifiers. In this study, the combined MARS metrics are used to find
the best combination of two models for identifying more target records (i.e., more
accounting control issues).
When identifying control issues from online employee reviews as a
ranking task, the ranking quality can be determined by a ranking cutoff after the
application sorts the reviews from most likely to least likely to be of the target class.
A model that can identify more true positive reviews within the cutoff is the preferred
one. One way to evaluate the ranking quality is to count the raw number of true
positives (mentions of control issues) on the top-ranking set of reviews. Another
evaluation metric widely used to measure the effectiveness of a ranked list of items

17
The protocol for labeling the reviews with mentions of Principle 13-15 is “Sufficient and relevant information is
identified, captured and provided to the right people in a timely manner, enabling them to take action to meet
objectives and mitigate risks. Open lines of communication, upstream, downstream, cross functionally, and
externally, promote understanding and acceptance of values and assist the division/department in meeting their
objectives”, as presented in Appendix E.

Ref. code: 25655902320018OLX


47

used in information retrieval (IR) is normalized discounted cumulative gain (nDCG)


(Järvelin & Kekäläinen, 2002). nDCG takes into account both the relevance of the
items and their position in the list. It is particularly useful when the goal of the task is
to maximize the number of relevant items in the top positions of the list. Hence, this
study uses the raw number of true positive examples at the top-ranked reviews within
cutoff and nDCG to access the ranking quality of all machine learning models,
including smoke term analysis.

4.3.3 The selection of built-in prediction models


This study aims to develop a web application with built-in
prediction models that automatically rank online employee reviews to aid a decision-
maker in identifying records with mentions of control issues for closer analysis. These
models were selected based on their performance and interpretability. The assessment
of model performance focuses on their ability to accurately classify reviews into those
with mentions of control issues versus those without and their ability to rank the most
relevant reviews at the top of the list.
4.3.3.1 Evaluating classification performance
As previously outlined, the best-performing models for the
task of identifying online employee reviews with mentions of control issues are
chosen by comparing four popular machine learning algorithms - decision tree,
logistic regression, random forest, and SVM - along with four pre-trained word
embeddings - Word2Vec, GloVe, BERT, and GPT2. The models were trained and
tested on the online employee review datasets created in Section 4.1, and the
experimental results are presented in Table 4.2.
As shown in Table 4.3 Panel A, the results from analyzing
the sub-dataset Principle 13-15 indicate that a biLSTM model with GPT2 as the word
embedding produces the best performance across all metrics. The AUC value of 0.95
indicates that the model can accurately classify positive examples as having a higher
probability of being positive than negative examples in 95% of cases. Additionally,
the model achieves high scores for Recall (0.91%), Precision (0.93%), and Accuracy
(0.92%). For the Principle 3 dataset, as in Table 4.3 Panel B, the biLSTM model with
BERT as the word embedding yields the best AUC value (0.82), Precision (81%), and

Ref. code: 25655902320018OLX


48

Accuracy (0.76%). The model with Word2Vec as the word embedding produces the
highest Recall value (72%), indicating that it correctly identifies 72% of the true
positive examples in the dataset.
Table 4.3 Experimental results
Panel A: Principle 13-15
Control code Classifier AUC Recall Precision Accuracy
Word2Vec 0.87 86% 77% 80%
GloVe 0.90 86% 84% 85%
BERT 0.73 42% 79% 65%
GPT2 0.95 91% 93% 92%
Principle 13_15 Logistic
0.80 84% 77% 80%
regression
SVM 0.89 87% 79% 82%
Decision tree 0.79 83% 78% 79%
Random
0.90 87% 81% 83%
forest

Panel B: Principle 3
Control code Classifier AUC Recall Precision Accuracy
Word2Vec 0.74 72% 63% 65%
GloVe 0.71 67% 67% 67%
BERT 0.82 67% 81% 76%
GPT2 0.69 23% 60% 53%
Principle 3 Logistic
0.58 64% 57% 58%
regression
SVM 0.52 64% 56% 56%
Decision tree 0.49 41% 48% 49%
Random
0.58 38% 60% 56%
forest
*Bold indicates the best-performing model (row heading) on the current metric (column
heading).
Given that different machine learning models utilize distinct
approaches for data analytics and pattern recognition, they may be capable of
identifying unique examples (observations) that are not detected by other models.
Thus, by leveraging the predictive outputs of multiple models, decision-makers may
benefit from combining their predictive outputs to identify additional target
observations that are exclusively recognized by specific models. The ShineThrough
Metric and Occlusion Metric, as proposed by Mali et al. (2022) and Restrepo et al.

Ref. code: 25655902320018OLX


49

(2022), are performance metrics that assess a model's ability to identify exclusive
observations. These metrics can be applied to evaluate both the individual and the
combination of two models. By comparing the metrics for individual models and the
combined model, decision-makers can observe the extent to which the combined
model detects extra true positives through the ShineThrough Metric and reduces false
negatives through the Occlusion Metric.
Tables 4.4 & 4.5 provide an overview of the MARS metrics
for Principle 13-15 and Principle 3, respectively. Table 4.3 Panel A reveals that the
difference in the ShineThrough scores between the individual and combined models is
mild, with the highest ShineThrough score of 0.02 in multiple combinations of two
models. Given that GPT2 offers the best recall score (suggesting GPT2 correctly
identified the highest number of true positive examples), the optimal combination of
two models to identify more true positive observations is that of GPT2 and GloVe.
However, the advantages of model combination are limited, as the combined
ShineThrough score is 0.02, signifying that a mere 2% of the total true positive
observations identified by all classifiers were exclusive to these two classifiers
(namely GPT2 and GloVe) and not detected by any other classifiers.
Meanwhile, as revealed in Table 4.4 Panel B, GPT2 offers
the lowest Occlusion score (0.08) among all individual models, suggesting that 8% of
the total true positive observations across all classifiers were incorrectly missed (false
negatives) by GPT2 but correctly recognized by at least one other classifier. The
combination of GPT2 and decision tree further decreased the Occlusion score from
0.08 to 0.03. Thus, the decision tree model is incorporated into the web application as
a supplementary classifier for users who wish to explore additional reviews after
reviewing all the records identified by the GPT2 model.

Ref. code: 25655902320018OLX


50

Table 4.4 MARS Metric for Principle 13-15


Panel A: ShineThrough Metric
Decision Random Logistic
x_axis BERT GPT2 GloVe SVM Word2Vec
Tree Forest regression
y_axis
BERT - 0.0 | 0.01 0.0 | 0.0 0.0 | 0.01 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.01
Decision
0.01 | 0.01 - 0.01 | 0.01 0.01 | 0.02 0.01 | 0.01 0.01 | 0.01 0.01 | 0.02 0.01 | 0.02
Tree
GPT2 0.0 | 0.0 0.0 | 0.01 - 0.0 | 0.02 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 | 0.01
GloVe 0.01 | 0.01 0.01 | 0.02 0.01 | 0.02 - 0.01 | 0.01 0.01 | 0.01 0.01 | 0.01 0.01 | 0.02
Random
0.0 | 0.0 0.0 | 0.01 0.0 | 0.0 0.0 | 0.01 - 0.0 | 0.0 0.0 | 0.0 0.0 | 0.01
Forest
Logictic
0.0 | 0.0 0.0 | 0.01 0.0 | 0.0 0.0 | 0.01 0.0 | 0.0 - 0.0 | 0.0 0.0 | 0.01
regression
SVM 0.0 | 0.0 0.0 | 0.02 0.0 | 0.0 0.0 | 0.01 0.0 | 0.0 0.0 | 0.0 - 0.0 | 0.01
Word2Vec 0.01 | 0.01 0.01 | 0.02 0.01 | 0.01 0.01 | 0.02 0.01 | 0.01 0.01 | 0.01 0.01 | 0.01 -
Panel B: Occlusion Metric
Decision Random Logistic
x_axis BERT GPT2 GloVe SVM Word2Vec
Tree Forest regression
y_axis
BERT - 0.58 | 0.09 0.58 | 0.06 0.58 | 0.06 0.58 | 0.11 0.58 | 0.09 0.58 | 0.08 0.58 | 0.07
Decision
0.16 | 0.09 - 0.16 | 0.03 0.16 | 0.04 0.16 | 0.08 0.16 | 0.07 0.16 | 0.06 0.16 | 0.04
Tree
GPT2 0.08 | 0.06 0.08 | 0.03 - 0.08 | 0.05 0.08 | 0.06 0.08 | 0.05 0.08 | 0.04 0.08 | 0.04
GloVe 0.13 | 0.06 0.13 | 0.04 0.13 | 0.05 - 0.13 | 0.05 0.13 | 0.05 0.13 | 0.05 0.13 | 0.06
Random
0.16 | 0.11 0.16 | 0.08 0.16 | 0.06 0.16 | 0.05 - 0.16 | 0.1 0.16 | 0.08 0.16 | 0.06
Forest
Logictic
0.14 | 0.09 0.14 | 0.07 0.14 | 0.05 0.14 | 0.05 0.14 | 0.1 - 0.14 | 0.11 0.14 | 0.06
regression
SVM 0.12 | 0.08 0.12 | 0.06 0.12 | 0.04 0.12 | 0.05 0.12 | 0.08 0.12 | 0.11 - 0.12 | 0.06
Word2Vec 0.13 | 0.07 0.13 | 0.04 0.13 | 0.04 0.13 | 0.06 0.13 | 0.06 0.13 | 0.06 0.13 | 0.06 -

Ref. code: 25655902320018OLX


51

Table 4.5 MARS Metric for Principle 3


Panel A: ShineThrough Metric
Decision Logistic Random
x_axis BERT GPT2 GloVe SVM Word2Vec
Tree regression Forest
y_axis
BERT - 0.08 | 0.08 0.08 | 0.13 0.08 | 0.11 0.08 | 0.08 0.08 | 0.08 0.08 | 0.08 0.08 | 0.08
Decision Tree 0.0 | 0.08 - 0.0 | 0.05 0.0 | 0.03 0.0 | 0.0 0.0 | 0.0 0.0 | 0.03 0.0 | 0.0
GPT2 0.05 | 0.13 0.05 | 0.05 - 0.05 | 0.08 0.05 | 0.05 0.05 | 0.05 0.05 | 0.05 0.05 | 0.05
GloVe 0.03 | 0.11 0.03 | 0.03 0.03 | 0.08 - 0.03 | 0.03 0.03 | 0.03 0.03 | 0.03 0.03 | 0.11

Logistic regression 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.03 - 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0

Random Forest 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.03 0.0 | 0.0 - 0.0 | 0.0 0.0 | 0.0

SVM 0.0 | 0.08 0.0 | 0.03 0.0 | 0.05 0.0 | 0.03 0.0 | 0.0 0.0 | 0.0 - 0.0 | 0.0
Word2Vec 0.0 | 0.08 0.0 | 0.0 0.0 | 0.05 0.0 | 0.11 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 -

Panel B: Occlusion Metric


Decision Logistic Random
x_axis BERT GPT2 GloVe SVM Word2Vec
Tree regression Forest
y_axis
BERT - 0.32 | 0.24 0.32 | 0.26 0.32 | 0.13 0.32 | 0.18 0.32 | 0.24 0.32 | 0.16 0.32 | 0.11
Decision Tree 0.55 | 0.24 - 0.55 | 0.42 0.55 | 0.16 0.55 | 0.29 0.55 | 0.45 0.55 | 0.29 0.55 | 0.21
GPT2 0.76 | 0.26 0.76 | 0.42 - 0.76 | 0.21 0.76 | 0.24 0.76 | 0.42 0.76 | 0.21 0.76 | 0.18
GloVe 0.32 | 0.13 0.32 | 0.16 0.32 | 0.21 - 0.32 | 0.16 0.32 | 0.18 0.32 | 0.13 0.32 | 0.16

Logistic regression 0.34 | 0.18 0.34 | 0.29 0.34 | 0.24 0.34 | 0.16 - 0.34 | 0.34 0.34 | 0.32 0.34 | 0.18

Random Forest 0.55 | 0.24 0.55 | 0.45 0.55 | 0.42 0.55 | 0.18 0.55 | 0.34 - 0.55 | 0.34 0.55 | 0.26

SVM 0.34 | 0.16 0.34 | 0.29 0.34 | 0.21 0.34 | 0.13 0.34 | 0.32 0.34 | 0.34 - 0.34 | 0.18
Word2Vec 0.26 | 0.11 0.26 | 0.21 0.26 | 0.18 0.26 | 0.16 0.26 | 0.18 0.26 | 0.26 0.26 | 0.18 -

Ref. code: 25655902320018OLX


52

4.3.3.2 Evaluating ranking quality


Table 4.5 reports the MARS metrics for Principle 3. Among
individual classifiers, BERT exhibited the highest ShineThrough score (0.08), while
Word2Vec offered the lowest Occlusion score (0.26). Further improvements in the
scores were observed upon combining two classifiers. Specifically, the combination
of BERT and GPT2 resulted in the highest ShineThrough score (0.13), and the pairing
of BERT with Word2Vec yielded the lowest Occlusion score (0.11). Since both
combinations involve the model BERT and the extent of improvement in the MARS
metrics is greater for the Occlusion score, Word2Vec is provided in the web
application as a supplementary model for users who wish to explore additional
records after reviewing those identified by the BERT model.
Due to the following two reasons, random 1000 Glassdoor
reviews from a population of over 10,000 reviews were used to evaluate the models'
ranking quality instead of using the balanced online employee reviews dataset. Firstly,
the sampling, as shown in Table 4.1 Section 4.1 revealed a low percentage of reviews
related to accounting control issues. In particular, the proportion of reviews
mentioning control issues relating to Principle 3 is approximately 1%~3%. Therefore,
evaluating the ranking quality of models using the balanced dataset generated from
Section 4.1 would not be reflective of reality. As such, this study employed holdout
testing to evaluate the ranking quality of models. Secondly, the dataset generated in
Section 4.1 is relatively small. In particular, Principle 3 dataset only contains 195
positive labels. After excluding the observations used for training, there are not
enough observations left for holdout testing to measure the ranking quality of the
models effectively.
The same three authority coders coded the downloaded
reviews, following the protocols outlined in Section 4.1. For simplicity, this study
assessed the ranking quality for only the three best classification models and the
smoke term analysis for each control issue type. One of the advantages of smoke term
analysis is its interpretability of the resulting outcomes. Previous studies have
demonstrated that smoke term analysis can deliver competitive results to advanced
deep learning techniques and has exhibited superior performance compared to

Ref. code: 25655902320018OLX


53

sentiment analysis in various ranking tasks (e.g., Goldberg & Abrahams, 2018;
Zaman et al., 2021).
The model's ranking quality was evaluated based on the top
50, 100, 150, and 200 ranked reviews, as per the pre-determined cutoff. This study
utilized the number of true positive reviews at the top-ranked reviews and normalized
discounted cumulative gain (nDCG) as the performance metrics to evaluate the
ranking quality. Table 4.6 presents the results at various pre-determined cutoffs for
each control issue type. As shown in Table 4.6 Panel A, we can observe that GPT2
offers the best performance regarding the number of true relevant reviews and nDCG
at different cutoff points, consistent with findings in Table 4.3 Panel A. These results
suggest that the GPT2 model identified the highest number of relevant reviews at the
top of the ranking list and effectively prioritized them for control issues relating to
Principle 13-15. Smoke term analysis (column: 'Unigram') followed closely as the
second-best performing model.
Table 4.6, Panel B provides an evaluation of the ranking
performance of the models in identifying relevant reviews related to Principle 3 across
varying cutoff thresholds. Notably, while BERT identified the highest number of
reviews with mentions of control issues relating to Principle 3 in the top 50 ranked
reviews, smoke term analysis outperformed the former in terms of the raw number of
true positives and nDCG at the top 100, 150, and 200 ranked reviews. Nevertheless,
the integration of Word2Vec with BERT resulted in a slightly higher number of true
positive records compared to the smoke term analysis, indicating the effectiveness of
the proposed combination strategy.

Ref. code: 25655902320018OLX


54

Table 4.6 Ranking quality evaluation: number of true positive reviews and normalized
discounted cumulative gains (nDCG) in top-ranked reviews
Panel A: Principle 13_15
GPT2 GloVe Random forest GPT2+DT Unigram
True positive 45 29 31 - 41
nDCG 0.93 0.62 0.63 - 0.87
Top 50
Extra true positive 4
- - - -
(Total) (49)

True positive 60 45 46 - 49
nDCG 0.88 0.63 0.63 - 0.76
Top 100
Extra true positive 4
- - - -
(Total) (64)

True positive 61 54 55 - 53
nDCG 0.89 0.72 0.72 - 0.79
Top 150
Extra true positive 3
- - - -
(Total) (64)

True positive 63 56 57 - 60
nDCG 0.91 0.74 0.75 - 0.85
Top 200
Extra true positive 3
- - - -
(Total) (66)

Panel B: Principle 3
BERT Word2Vec GloVe BERT+W2V Unigram
True positive 11 7 9 - 10
nDCG 0.32 0.12 0.24 - 0.28
Top 50
Extra true positive 4
- - - -
(Total) (15)

True positive 16 8 12 - 18
nDCG 0.41 0.15 0.29 - 0.41
Top 100
Extra true positive 2
- - - -
(Total) (18)

True positive 18 14 14 - 22
nDCG 0.44 0.25 0.33 - 0.48
Top 150
Extra true positive 5
- - - -
(Total) (23)

True positive 19 17 16 - 24
nDCG 0.44 0.30 0.35 - 0.51
Top 200
Extra true positive 6
- - - -
(Total) (25)

Ref. code: 25655902320018OLX


55

Fig. 4.2 Accumulated true positive in top ranked reviews for


Principle 13-15
70
Number of true positive

60
50
40
30
20
10
0
1 21 41 61 81 101 121 141 161 181
Rank of reviews

GPT2 Random Forest GloVe Unigram Baseline (random guess)

Fig. 4.3 Accumulated true positive in top ranked reviews for


Principle 3
30
Number of true positive

25

20

15

10

0
1 21 41 61 81 101 121 141 161 181
Rank of reviews

BERT Word2Vec GloVe Unigram Baseline (random guess)

Ref. code: 25655902320018OLX


56

In order to demonstrate the effectiveness of the candidate


models in identifying reviews mentioning control issues, cumulative true relevant
reviews at the top of the ranking list were plotted, as illustrated in Figure 4.2
(Principles 13-15) and Figure 4.3 (Principle 3). A baseline was established based on
the assumption of random guessing as to whether a review pertains to the
corresponding control issue; this baseline is the 45 degree diagonal line (in Figures
4.2 and 4.3, respectively).
4.3.3.3 Summary
This subsection provides a comparative analysis of the
effectiveness of various machine learning models in identifying online employee
reviews with mentions of select control issues (specifically, those relating to Principle
3, and to Principle 13-15 of the COSO framework). The performance evaluation of
each model is based on its classification accuracy, ranking quality, and its
interpretability. This study also explored the benefits of combining two classifiers for
users who wish to identify additional relevant reviews after they feel they have
exhausted the valuable records from individual classifiers. In particular, GPT2
achieved superior performance compared to other individual models in detecting
reviews related to Principle 13-15. A combination of GPT2 and decision tree appears
to be the most suitable choice for users who wish to identify more relevant reviews
relating to Principle 13-15 according to the MARS metrics. The smoke term analysis
offers the second-best performance in terms of ranking quality. More importantly, this
method facilitates communication of the model's inner workings to decision-makers
through emphasized smoke terms, thereby enhancing interpretability. As such, GPT2,
decision tree, and smoke term analysis were selected for deployment in the web
application development.
Similarly, three models, namely BERT, Word2Vec, and
smoke term analysis, were chosen to identify reviews with mentions of control issues
related to Principle 3. BERT displayed superior classification performance in general
and demonstrated the best ranking quality among the top 50 ranked reviews.
Word2Vec can serve as an effective supplementary model for users seeking to
combine BERT's outputs to identify additional relevant target records. The smoke
term analysis offered the best ranking quality at the cutoff points of the top 100, 150,

Ref. code: 25655902320018OLX


57

and 200. Considering performance and interpretability, smoke term analysis appears
to be the best model for the task.

4.3.4 Software description


4.3.4.1 Software architecture
The overall development aims to create a responsive web
application that helps users identify online employee reviews with control issues,
which can be used for further investigation. The architecture of the web application
consists of a user interface and a backend that utilizes pre-trained machine learning
models for predictions, as depicted in Figure 4.4. In this work, Streamlit18, an open-
source web framework based on Python, is utilized as the web framework to create
the web application. Streamlit seamlessly integrates with popular data science
frameworks such as PyTorch and supports a diverse range of data science libraries
such as Pandas, NumPy, NLTK, and Transformers, making it a versatile and potent
tool for data scientists and developers. The models designed to detect reviews with
mentions of each control issue were pre-trained and are ready to be loaded for
predictions through an application programming interface (API) call. Streamlit is
implemented to develop responses to the APIs call initiated by the user interface.
Through the user interface, users can upload files, select the
type of control issue, and choose the relevant pre-trained model for predictions. Upon
selection, the application will load the pre-trained model, process the data in the
backend, and display the predictions for download by the users. The deployment of
the software is facilitated through Render’s19 web services, which provide a platform
for hosting and scaling web applications.

18
https://www.streamlit.io/
19
https://render.com/

Ref. code: 25655902320018OLX


58

Figure 4.4 Web application architecture

4.3.4.2 Software functionalities

Fig. 4.5 illustrates the workflows of the web application. The main
functionalities are:

Figure 4.5 Web application components

 Control issue type selection: The user can select one of the two control
issue types provided to detect the potential control problems from online
employee reviews. The classification of the control issue types is based on
the COSO’s 2013 internal control framework. (Figure 4.6)
 Machine learning model selection: The user can select the pretrained
machine learning model for prediction. The introduction of the models and
their performance is provided. (Figure 4.7)
 File upload: The user can upload a CSV file of textual narratives. (Figure
4.8)

Ref. code: 25655902320018OLX


59

 Text analysis: Once the user uploads the file, the text classification and
ranking will be performed automatically.
 Outputs preview and download: The user can preview the outputs of the
text analysis and download it in “csv” format. (Figure 4.8)

The web application can be accessed via: https://accounting-issues-


identifier.onrender.com. The link to the code/repository used is
https://gitlab.com/control_issues_identifier/Control_issues_identifier.

Ref. code: 25655902320018OLX


60

Figure 4.6: Main Page Layout for Selecting Control Issue Type

Ref. code: 25655902320018OLX


61

Figure 4.7 Layout for selecting a machine learning prediction model

Ref. code: 25655902320018OLX


62

Figure 4.8 Layout for file upload and prediction preview

Ref. code: 25655902320018OLX


63

4.4 Conclusion

Despite the significance of the reliability of financial reporting, corporate


governance, and investor protection, Internal Control Reports, as required by the
Sarbanes-Oxley Act (SOX) Section 404, are the sole publicly accessible information
source for the effectiveness of internal controls. To overcome the information
asymmetry, recent research has turned to advanced artificial intelligence technologies
to uncover control issues in firms(Nasir et al., 2021; Sun, 2018, 2019), and yet the
potential of utilizing online employee reviews to infer accounting control issues have
previously remained largely unexplored.
Section 4.1 provides empirical evidence that online employee reviews
contain valuable information pertaining to a firm's control practices. Furthermore,
firms that have disclosed material weaknesses in their internal controls tend to receive
more reviews related to control issues compared to those without. However, given the
rarity and sparsity of mentions of control issues in the reviews, identifying such issues
can be tedious. This study leveraged advanced machine learning algorithms to
develop a web application that assists the review process for decision-makers. With
just a few simple clicks, the app streamlines the process by automatically prioritizing
(scoring and ranking) uploaded records, with those most likely containing mentions of
control issues appearing at the top of the list. Notably, besides ranking the records, the
built-in “smoke term analysis” model can generate a list of “smoke terms” (prevalent
words or phrases in the target class) for each record, providing decision-makers with
insight into the underlying rationale of the model's decision-making process. These
smoke terms can assist decision-makers in making a more informed decision during a
more in-depth analysis of the records.
Given the number of employee reviews on Glassdoor varies considerably
across firms, ranging from over ten thousand reviews to fewer than ten, the web-based
tool provided can be particularly useful when the number of reviews is large. It can be
used by individual researchers, institutions, auditors, or other stakeholders that wish to
obtain supplementary information about the status of control implementations from
the perspective of employees, facilitating the evaluation of the effectiveness of
internal control in a more comprehensive way.

Ref. code: 25655902320018OLX


64

However, the current study has a number of limitations. In Section 4.1, 9


types of control issues that may be inferred directly from employees’ self-reports of
working experiences were identified as valuable indicators for the purpose of
detecting internal control deficiencies. However, due to the cost and the professional
judgments required for manual labeling, the employee reviews dataset created in
Section 4.1 only includes two types of control issues (specifically, control issues
relating to Principle 3 and Principles 13-15). Thus, the web application is currently
restricted to providing two options for the control issue type (Principle 3 being the
first option and Principle 13-15 being the second and final option). Future work can
expand the scope of labeled data to enhance the practical utility of the tool. The cost
of manual labeling may be reduced with advanced generative AI (e.g., Chatgpt).
Nevertheless, the findings of current work might still be valuable in indicating
internal control weakness. Future works can also include the textual features extracted
from the employee reviews and use them as a new variable to enhance the
performance of the nascent detection model for internal control material weakness.

Ref. code: 25655902320018OLX


65

REFERENCES

Book and book articles

Allison, P. D. (2001). Missing data. Sage publications.


Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis (Vol. 3).
Wiley New York.

Articles

Abrahams, A. S., Jiao, J., Fan, W., Wang, G. A., & Zhang, Z. (2013). What's buzzing
in the blizzard of buzz? Automotive component isolation in social media
postings. Decision Support System, 55(4), 871-882.
Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery
from social media. Decision Support Systems, 54(1), 87-97.
AFCE. (2020). 2020 GLOBAL STUDY ON OCCUPATIONAL FRAUD AND ABUSE.
https://acfepublic.s3-us-west-2.amazonaws.com/2020-Report-to-the-
Nations.pdf
Alden, M. E., Bryan, D. M., Lessley, B. J., & Tripathy, A. (2012). Detection of
financial statement fraud using evolutionary algorithms. Journal of Emerging
Technologies in Accounting, 9(1), 71-94.
Ashbaugh-Skaife, H., Collins, D. W., & Kinney Jr, W. R. (2007). The discovery and
reporting of internal control deficiencies prior to SOX-mandated audits.
Journal of Accounting and Economics, 44(1-2), 166-192.
Aswani, J., & Fiordelisi, F. (2020). Tournament Culture and Corporate Misconduct:
Evidence using Machine Learning. Available at SSRN.
Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting accounting fraud in
publicly traded US firms using a machine learning approach. Journal of
Accounting Research, 58(1), 199-235.
Bardhan, I., Lin, S., & Wu, S.-L. (2015). The quality of internal control over financial
reporting in family firms. Accounting Horizons, 29(1), 41-60.

Ref. code: 25655902320018OLX


66

Batista, G. E., & Monard, M. C. (2003). An analysis of four missing data treatment
methods for supervised learning. Applied Artificial Intelligence, 17(5-6), 519-
533.
Beneish, M. D. (1997). Detecting GAAP violation: Implications for assessing
earnings management among firms with extreme financial performance.
Journal of Accounting and Public Policy, 16(3), 271-309.
Beneish, M. D. (1999). The detection of earnings manipulation. Financial Analysts
Journal, 55(5), 24-36.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
Brahma, A., Goldberg, D. M., Zaman, N., & Aloiso, M. J. D. S. S. (2021). Automated
mortgage origination delay detection from textual conversations. Decision
Support Systems (Jan.), 140.
Breiman, L. J. M. l. (2001). Random forests. Machine Learning, 45, 5-32.
Calderon, T. G., Song, H., & Wang, L. (2016). Audit deficiencies related to internal
control: An analysis of PCAOB inspection reports. The CPA Journal, 86(2),
32.
Campbell, D., & Shang, R. (2021). Tone at the Bottom: Measuring Corporate
Misconduct Risk from the Text of Employee Reviews. Management Science,
68(9), 7034-7053.
Campbell, S., Li, Y., Yu, J., & Zhang, Z. (2016). The impact of occupational
community on the quality of internal control. Journal of Business Ethics,
139(2), 271-285.
Cao, S., Jiang, W., Yang, B., & Zhang, A. L. (2020). How to talk when a machine is
listening: Corporate disclosure in the age of AI (No. w27950). National
Bureau of Economic Research.
Cavicchini, A., Ferraro, F., & Samila, S. (2021). Under Pressure: Cultural and
Structure as Antecedents of Organizational Misconduct. Academy of
Management Proceedings (Vol. 2021, No. 1, p. 16077). Briarcliff Manor, NY
10510: Academy of Management.
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010a). Detecting management
fraud in public companies. Management Science, 56(7), 1146-1160.

Ref. code: 25655902320018OLX


67

Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010b). Making words work:
Using financial text as a predictor of financial events. Decision Support
Systems, 50(1), 164-175.
Chen, H., Grant-Muller, S., Mussone, L., & Montgomery, F. (2001). A study of
hybrid neural network approaches and the effects of missing data on traffic
forecasting. Neural Computing & Applications, 10(3), 277-286.
Cheng, S., Felix, R., & Indjejikian, R. J. C. A. R. (2019). Spillover effects of internal
control weakness disclosures: The role of audit committees and board
connections. Comtemporary Accounting Research, 36(2), 934-957.
Chu, Y., Kaushik, A. C., Wang, X., Wang, W., Zhang, Y., Shan, X., Salahub, D. R.,
Xiong, Y., & Wei, D.-Q. (2021). DTI-CDF: a cascade deep forest model
towards the prediction of drug-target interactions based on hybrid features.
Briefings in Bioinformatics, 22(1), 451-462.
Coates, J. C., & Srinivasan, S. (2014). SOX after ten years: A multidisciplinary
review. Accounting Horizons, 28(3), 627-671.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational
psychological measurement, 20(1), 37-46.
Corritore, M., Goldberg, A., & Srivastava, S. B. (2020). The new analytics of culture.
Harvard Business Review, 98(1), 76-83.
Cortes, C., & Vapnik, V. J. M. l. (1995). Support-vector networks. 20, 273-297.
Craja, P., Kim, A., & Lessmann, S. (2020). Deep learning for detecting financial
statement fraud. Decision Support Systems, 139, 113421.
Dechow, P. M., Ge, W., Larson, C. R., & Sloan, R. G. (2011). Predicting material
accounting misstatements. Contemporary Accounting Research, 28(1), 17-82.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805.
Donelson, D. C., Ege, M. S., & McInnis, J. M. (2017). Internal control weaknesses
and financial reporting fraud. Auditing: A Journal of Practice & Theory,
36(3), 45-69.
Donelson, D. C., Kartapanis, A., McInnis, J. M., & Yust, C. G. (2021). Measuring
Accounting Fraud and Irregularities Using Public and Private

Ref. code: 25655902320018OLX


68

EnforcementMeasuring Financial Reporting Fraud. The accounting review.


96(6), 183-213.
Doyle, J., Ge, W., & McVay, S. (2007). Determinants of weaknesses in internal
control over financial reporting. Journal of Accounting and Economics, 44(1-
2), 193-223.
Dyck, A., Morse, A., & Zingales, L. (2010). Who blows the whistle on corporate
fraud? The Journal of Finance, 65(6), 2213-2253.
Earley, C. E. (2015). Data analytics in auditing: Opportunities and challenges.
Business Horizons, 58(5), 493-500.
Elkan, C. (2001). The foundations of cost-sensitive learning. International joint
conference on artificial intelligence, (Vol. 17, No. 1, pp. 973-978). Lawrence
Erlbaum Associates Ltd.
Fan, W., Gordon, M. D., & Pathak, P. (2005). Effective profiling of consumer
information retrieval needs: a unified framework and empirical comparison.
40(2), 213-233.
Fanning, K. M., & Cogger, K. O. (1998). Neural network detection of management
fraud using published financial data. Intelligent Systems in Accounting,
Finance & Management, 7(1), 21-41.
Feelders, A. (1999). Handling missing data in trees: surrogate splits or statistical
imputation? European Conference on Principles of Data Mining and
Knowledge Discovery, (pp. 329-334). Berlin, Heidelberg: Springer Berlin
Heidelberg.
Feroz, E. H., Kwon, T. M., Pastena, V. S., & Park, K. (2000). The efficacy of red
flags in predicting the SEC's targets: an artificial neural networks approach.
Intelligent Systems in Accounting, Finance & Management, 9(3), 145-157.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A
review on ensembles for the class imbalance problem: bagging-, boosting-,
and hybrid-based approaches. IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
Goel, S., & Gangolly, J. (2012). Beyond the numbers: Mining the annual reports for
hidden cues indicative of financial statement fraud. Intelligent Systems in
Accounting, Finance and Management, 19(2), 75-89.

Ref. code: 25655902320018OLX


69

Goel, S., & Uzuner, O. (2016). Do sentiments matter in fraud detection? Estimating
semantic orientation of annual reports. Intelligent Systems in Accounting,
Finance and Management, 23(3), 215-239.
Goldberg, D. M., & Abrahams, A. S. J. D. S. S. (2018). A Tabu search heuristic for
smoke term curation in safety defect discovery. 105, 52-65.
Goldberg, D. M., Zaman, N., Brahma, A., & Aloiso, M. (2022). Are mortgage loan
closing delay risks predictable? A predictive analysis using text mining on
discussion threads. Journal of the Association for Information Science and
Technology, 73(3), 419-437.
Green, T. C., Huang, R., Wen, Q., & Zhou, D. (2019). Crowdsourced employer
reviews and stock returns. Journal of Financial Economics, 134(1), 236-251.
Guiso, L., Sapienza, P., & Zingales, L. (2015). The value of corporate culture.
Journal of Financial Economics, 117(1), 60-76.
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent
detection of financial statement fraud–A comparative study of machine
learning methods. Knowledge-Based Systems, 128, 139-152.
Hales, J., Moon Jr, J. R., & Swenson, L. A. (2018). A new era of voluntary
disclosure? Empirical evidence on how employee postings on social media
relate to future corporate disclosures. Accounting, Organizations and Society,
68, 88-108.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions
on knowledge and data engineering, 21(9), 1263-1284.
Hooghiemstra, R., Hermes, N., & Emanuels, J. (2015). National culture and internal
control disclosures: A cross‐country analysis. Corporate Governance: An
International Review, 23(4), 357-377.
Hoogs, B., Kiehl, T., Lacomb, C., & Senturk, D. (2007). A genetic algorithm
approach to detecting temporal patterns indicative of financial statement fraud.
Intelligent Systems in Accounting, Finance & Management: International
Journal, 15(1‐2), 41-56.
Howell, D. C. (2007). The treatment of missing data. The Sage handbook of social
science methodology, 208-224.

Ref. code: 25655902320018OLX


70

Huang, S., & Yang, X. (2010). Internal control report, quality of financial reports and
information asymmetry: Empirical evidence from listed companies in
Shanghai Stock Exchange. Journal of Finance and Economics, 36, 81-91.
Humpherys, S. L., Moffitt, K. C., Burns, M. B., Burgoon, J. K., & Felix, W. F.
(2011). Identification of fraudulent financial statements using linguistic
credibility analysis. Decision Support Systems, 50(3), 585-594.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR
techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422-
446.
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., &
Franco, L. (2010). Missing data imputation using statistical and machine
learning methods in a real breast cancer problem. Artificial intelligence in
medicine, 50(2), 105-115.
Ji, Y., Rozenbaum, O., & Welch, K. (2017). Corporate culture and financial reporting
risk: Looking through the glassdoor. Available at SSRN 2945745.
Johnstone, K., Li, C., & Rupley, K. H. (2011). Changes in corporate governance
associated with the revelation of internal control material weaknesses and their
subsequent remediation. Contemporary Accounting Research, 28(1), 331-383.
Karpoff, J. M., Koester, A., Lee, D. S., & Martin, G. S. (2017). Proxies and databases
in financial misconduct research. The Accounting Review, 92(6), 129-163.
Kim, Y. J., Baik, B., & Cho, S. (2016). Detecting financial misstatements with fraud
intention using multi-class cost-sensitive learning. Expert Systems with
Applications, 62, 32-43.
Kotsiantis, S., Koumanakos, E., Tzelepis, D., & Tampakas, V. (2006). Forecasting
fraudulent financial statements using data mining. International journal of
computational intelligence, 3(2), 104-110.
Koubli, E., Palmer, D., Rowley, P., & Gottschalg, R. (2016). Inference of missing
data in photovoltaic monitoring datasets. IET Renewable Power Generation,
10(4), 434-439.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. biometrics, 159-174.

Ref. code: 25655902320018OLX


71

Li, J. (2022). The effect of employee satisfaction on effective corporate tax planning:
Evidence from Glassdoor. Advances in accounting, 57, 100597.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R
news, 2(3), 18-22.
Lin, C.C., Chiu, A.A., Huang, S. Y., & Yen, D. C. (2015). Detecting the financial
statement fraud: The analysis of the differences between data mining
techniques and experts’ judgments. Knowledge-Based Systems, 89, 459-470.
Lin, Y. C., Wang, Y. C., Chiou, J. R., & Huang, H. W. (2014). CEO characteristics
and internal control quality. Corporate Governance: An International Review,
22(1), 24-42.
Liu, X. Y., & Zhou, Z. H. (2006). The influence of class imbalance on cost-sensitive
learning: An empirical study. Sixth International Conference on Data Mining
(ICDM'06),
Liu, Y., Zhang, S., Ding, B., Li, X., & Wang, Y. (2018). A cascade forest approach to
application classification of mobile traces. 2018 IEEE Wireless
Communications and Networking Conference (WCNC),
Ma, C., Liu, Z., Cao, Z., Song, W., Zhang, J., & Zeng, W. (2020). Cost-sensitive deep
forest for price prediction. Pattern recognition, 107, 107499.
Mali, N., Restrepo, F., Abrahams, A., & Ractham, P. J. S. I. (2022). Implementation
of mars metrics and Mars charts for evaluating classifier exclusivity: The
comparative uniqueness of binary classifier predictions. 12, 100259.
Mao, J., & Ettredge, M. (2016). Internal control deficiency disclosures among
Chinese reverse merger firms. Abacus, 52(3), 441-472.
Mao, M. Q., & Yu, Y. (2015). Analysts' cash flow forecasts, audit effort, and audit
opinions on internal control. Journal of Business Finance & Accounting, 42(5-
6), 635-664.
Mazza, T., & Azzali, S. (2015). Effects of internal audit quality on the severity and
persistence of controls deficiencies. International Journal of Auditing, 19(3),
148-165.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. J. a. p. a. (2013). Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781.

Ref. code: 25655902320018OLX


72

Moepya, S. O., Akhoury, S. S., & Nelwamondo, F. V. (2014). Applying cost-sensitive


classification for financial fraud detection under high class-imbalance. 2014
IEEE international conference on data mining workshop,
Nasir, M., Simsek, S., Cornelsen, E., Ragothaman, S., & Dag, A. (2021). Developing
a decision support system to detect material weaknesses in internal control.
Decision Support Systems, 113631.
Ngai, E. W., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of
data mining techniques in financial fraud detection: A classification
framework and an academic review of literature. Decision Support Systems,
50(3), 559-569.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word
representation. Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP),
Perols, J. L. (2011). Financial statement fraud detection: An analysis of statistical and
machine learning algorithms. Auditing: A Journal of Practice & Theory,
30(2), 19-50.
Perols, J. L., Bowen, R. M., Zimmermann, C., & Samba, B. (2017). Finding needles
in a haystack: Using data analytics to improve fraud prediction. The
accounting review, 92(2), 221-245.
Plumlee, M., & Yohn, T. L. J. A. H. (2010). An analysis of the underlying causes
attributed to restatements. 24(1), 41-64.
Purda, L., & Skillicorn, D. (2015). Accounting variables, deception, and a bag of
words: Assessing the tools of fraud detection. Contemporary Accounting
Research, 32(3), 1193-1223.
Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value
imputation for medical data. Expert Systems with Applications, 42(13), 5621-
5631.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. J. O. b. (2019).
Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Ravisankar, P., Ravi, V., Rao, G. R., & Bose, I. (2011). Detection of financial
statement fraud and feature selection using data mining techniques. Decision
Support Systems, 50(2), 491-500.

Ref. code: 25655902320018OLX


73

Restrepo, F., Mali, N., Abrahams, A., & Ractham, P. J. F. (2022). Formal definition
of the MARS method for quantifying the unique target class discoveries of
selected machine classifiers. 11(391), 391.
Rice, S. C., & Weber, D. P. J. J. o. A. R. (2012). How effective is internal control
reporting under SOX 404? Determinants of the (non‐) disclosure of existing
material weaknesses. 50(3), 811-843.
Rich, K. T., Roberts, B. L., & Zhang, J. X. (2018). Linguistic tone and internal control
reporting: evidence from municipal management discussion and analysis
disclosures. Journal of Governmental & Nonprofit Accounting, 7(1), 24-54.
Robertson, S. E. (1986). On relevance weight estimation and query
expansion. Journal of Documentation, 42(3), 182-188.
Su, L. N., Zhao, X. R., & Zhou, G. S. (2014). Do customers respond to the disclosure
of internal control weakness? Journal of Business Research, 67(7), 1508-1518.
Sun, T. (2018). The incremental informativeness of the sentiment of conference calls
for internal control material weaknesses. Journal of Emerging Technologies in
Accounting, 15(1), 11-27.
Sun, T. (2019). Applying deep learning to audit procedures: An illustrative
framework. Accounting Horizons, 33(3), 89-109.
Teoh, S. H. (2018). The promise and challenges of new datasets for accounting
research. Accounting, Organizations and Society, 68, 109-117.
Throckmorton, C. S., Mayew, W. J., Venkatachalam, M., & Collins, L. M. (2015).
Financial fraud detection using vocal, linguistic and financial cues. Decision
Support Systems, 74, 78-87.
Twala, B. (2009). An empirical comparison of techniques for handling incomplete
data using decision trees. Applied Artificial Intelligence, 23(5), 373-405.
Twala, B., & Nelwamondo, F. (2017). Enhancing the detection of financial statement
fraud through the use of missing value estimation, multivariate filter feature
selection and cost-sensitive classification. University of Johannesburg (South
Africa).
Vasarhelyi, M. A., Kogan, A., & Tuttle, B. M. (2015). Big Data in accounting: An
overview. Accounting Horizons, 29(2), 381-396.

Ref. code: 25655902320018OLX


74

Walker, S. (2020). A Needle Found: Machine learning does not significantly improve
corporate fraud detection beyond a simple screen on sales growth. Available at
SSRN.
Warren Jr, J. D., Moffitt, K. C., & Byrnes, P. (2015). How Big Data will change
accounting. Accounting Horizons, 29(2), 397-407.
West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: a
comprehensive review. Computers & security, 57, 47-66.
Winkler, M., Abrahams, A. S., Gruss, R., & Ehsani, J. P. J. D. s. s. (2016). Toy safety
surveillance from online reviews. 90, 23-32.
Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-
proportionate example weighting. Third IEEE international conference on data
mining,
Zaman, N., Goldberg, D. M., Abrahams, A. S., & Essig, R. A. (2020). Facebook
Hospital Reviews: Automated Service Quality Detection and Relationships
with Patient Satisfaction. Decision Sciences, 52(6), 1403-1431.
Zaman, N., Goldberg, D. M., Gruss, R. J., Abrahams, A. S., Srisawas, S., Ractham,
P , & Şeref, M M (2021). Cross-category defect discovery from online
reviews: Supplementing sentiment with category-specific
semantics. Information Systems Frontiers, 1-21.
Zhang, Y.L., Zhou, J., Zheng, W., Feng, J., Li, L., Liu, Z., Li, M., Zhang, Z., Chen,
C., & Li, X. (2019). Distributed deep forest and its application to automatic
detection of cash-out fraud. ACM Transactions on Intelligent Systems and
Technology (TIST), 10(5), 1-19.
Zhou, Y., & Makridis, C. (2019). Firm Reputation Following Financial Misconduct:
Evidence from Employee Ratings. Available at SSRN 3271455.
Zhou, Z. H., & Feng, J. (2019). Deep forest. National Science Review, 6(1), 74-86.
Zhou, Z. H.,& Liu, X. Y. (2005). Training cost-sensitive neural networks with
methods addressing the class imbalance problem. IEEE Transactions on
knowledge and data engineering, 18(1), 63-77.

Ref. code: 25655902320018OLX


APPENDICES

Ref. code: 25655902320018OLX


75

APPENDIX A
INDICATOR VARIABLES AND FREQUENCY AFFECTED BY
MISSTATEMENTS OVER THE TEST PERIOD 2003-2008

Indicator Relative
Rank Description Frequency
variables frequency
1 inc_exp_s Equals 1 if misstatement affected net income, hence,
e shareholder equity but could not be classified in any specific
169 31.8%
income, expense or equity accounts below in this table, 0
otherwise
2 rev Equals 1 if misstatement affected revenues, 0 otherwise 127 23.9%
3 asset Equals 1 if misstatement affected an asset account that could
not be classified in a separate individual asset account in this 63 11.8%
table, 0 otherwise
4 rec Equals 1 if misstatement affected accounts receivable, 0
46 8.6%
otherwise
5 inv Equals 1 if misstatement affected inventory, 0 otherwise 29 5.5%
6 res Equals 1 if misstatement affected reserves accounts, 0
29 5.5%
otherwise
7 liab Equals 1 if misstatement affected liabilities, 0 otherwise 26 4.9%
8 cogs Equals 1 if misstatement affected cost of goods sold, 0
26 4.9%
otherwise
9 pay Equals 1 if misstatement affected accounts payable, 0
17 3.2%
otherwise
10 mkt_sec Equals 1 if misstatement affected marketable securities, 0
0 0.0%
otherwise
11 debt Equals 1 if misstatement affected allowance for bad debts, 0
0 0.0%
otherwise
Total 532 100%
Source: Bao et al. (2020) Panel B, Table 7.

Ref. code: 25655902320018OLX


76

APPENDIX B
THE SAMPLING FIRMS IN THE DATA EXPLORATION PHASE

Samples of firms with reported Samples of firms without reported


SIC
ICMW ICMW
3577 Zebra Technologie Fortinet, Inc.
3714 Tenneco Inc. Borgwarner Inc.
7372 Progress Software Corp Marqeta, Inc.
5331 Tuesday Morning Inc. Ollie's Bargain Outlet Holdings, Inc.
8111 Cra International, Inc. LegalZoom
5080 Aar Corp SiteOne Landscape Supply, Inc.
2836 BIO-TECHNE Corp Seagen Inc.
5700 Tile Shop Holdings, Inc. At Home Group Inc.
8090 Teladoc Health, Inc. Option Care Health, Inc.
3576 Digi International Inc. A10 Networks, Inc.
3669 OneSpan Inc. Arlo Technologies, Inc.
3844 Varex Imaging Corp Osi Systems Inc.
2834 Usana Health Sciences Inc. Balchem Corp
4911 Avangrid, Inc. NRG Energy Inc.
7374 Internap Corp Verra Mobility Corp
3845 iRhythm Technologies, Inc. Nevro Corp
3350 Kaiser Aluminum Corp Century Aluminum Co
3949 YETI Holdings, Inc. Acushnet Holdings Corp
7370 Veeva Systems Inc. Nutanix Inc.

Ref. code: 25655902320018OLX


77

APPENDIX C
Coding protocol for mentions of
Principle 3: “Organizational structure, authority, and responsibility”

When assessing whether or not an employee review mentions a potential accounting


control issue relating to Principle 3, the following statement should be considered:
“The decisions and actions of different parts of the organization are
coordinated in a way that helps the department to meet its objectives.
Authority and responsibility are clearly defined and consistent with the
division/department’s objectives, so the right people can make decisions
and take actions ”
1) Mention: There is an indication that an employee believes the company is not
organized in an efficient way. This code also involves reviews indicating an
employee is not clear or confused about the organizational structure, reporting
lines, or roles and responsibilities. This code does NOT involve reviews
indicating an employee is not held accountable when he/she does not take on
his/her responsibilities.

Red flags: negative comments about the current or the change of organization
structure (e.g., “change for the sake of change”); confusion about roles and
responsibilities in the new structure; insufficient support to navigate the matrix
structure; inadequate or inconsistent authorization/empowerment; a perception
of role ambiguity (e.g., no clear job description); duplication of effort; no
work plan (e.g., work at a short notice, etc), and so forth.

Table1: “Mention” Sample Reviews


Code Examples

Employee 1: “ A little disorganized and disjointed across


the overall organization ”

Mentioned Employee 2: “Vague description for job opportunities ”

Employee 3: “ Daily fire fighting, contradictory requests


regarding what to focus on ”

2) No mention: If the review does not contain content related to the above, the
label should be “No mention”.

Ref. code: 25655902320018OLX


78

APPENDIX D
Coding protocol for mentions of
Principle 4: “Commitment to Competence”

When assessing whether or not an employee review mentions a potential accounting


control issue relating to Principle 4, the following statement should be considered:
“Employees in the division/department have the necessary knowledge,
skills and capacity to do their jobs. Competent individuals are retained in
alignment with the objectives.”
(There are various factors that can affect the retention of competent
employees. For this specific task, please ONLY focus on "performance
appraisal" as the factor related to the mentions of Principle 4.)
1) Mention: There is an indication that employees do not receive clear feedback
from their performance appraisal so that they can know which areas to improve
and how to address the problems. This code also involves employees feeling
confused or disagreeing with the performance metrics set.

Table 1: “Mentioned” Sample Reviews


Code Examples

Employee 13: "...Metrics are not accurate and are a source of stress
when trying to do what's right, but what's right isn't accounted for in
the metrics which are used for performance evaluations."

Mentioned Employee 14: "... Employees expected to complete peer to peer


evaluations which are degrading particularly since most employees lack
leadership experience...."

Employee 15: “ there is no performance evaluation here in XX...”

2) No mention: If the review does not contain content related to the above, the
label should be “No mention”.

Ref. code: 25655902320018OLX


79

APPENDIX E
Coding protocol for mentions of
Principle 11, 13-15: “Control over Technology, Information, &
Communication”

Principle 11: General control over technology


When assessing whether or not an employee review mentions a potential accounting
control issue relating to Principle 11, the following statement should be considered:
“The organization selects and develops proper general controls over
technology to ensure the reliance on the application controls ”
1) General control over technology: Information technology general controls
(ITGC) include controls over access and security, systems development and
modification, and operations (e.g., maintenance, backup, and disaster recovery).
This code involves information-technology-related (IT-related) deficiencies
regarding the above issues, which may result in non-reliance on ITGC in an IT
setup comprising applications, databases and operating systems.

Red flags: outdated information system; cyber-security breach, etc.

Table 1: “General Control over Technology” Sample Reviews


Code Examples

Employee 1: “Fragmented vision for software and


firmware led to inefficiencies. Not enough emphasis on
firmware/ software design and an understanding of how
this affects the back-end of schedules..... Corporate level
tools and support are antiquated ”
General control Employee 2: “Too many ancillary software systems that
over technology overlap.”

Employee 3: “The software used in many departments is


way out of date which causes many interruptions and calls
to IT. The IT group is so understaffed and has poor leaders
which contribute to lack of process and productivity ”

Ref. code: 25655902320018OLX


80

Principle 13-15: Information and communication


When assessing whether or not an employee review reveals a potential control
deficiency relating to Information and Communication, the following statement
should be considered:
“Sufficient and relevant information is identified, captured and provided to
the right people in a timely manner, enabling them to take action to meet
objectives and mitigate risks. Open lines of communication, upstream,
downstream, cross functionally, and externally, promote understanding and
acceptance of values and assist the division/department in meeting their
objectives.”
2) Information and communication: There is an indication that the information
flow generated or provided for decision making in the organization is of low
quality. This code also involves when communication, upstream, downstream,
cross functionally, or externally, is ineffective.

Red flags: low information quality (e.g., information overload; conflicting


information; lack of timely relevant information; incomplete information,
inaccurate data, etc); a perception of not being listened to; duplication of efforts,
and so forth.

Table 2: “Information and Communication” Sample Reviews


Code Examples

Employee 7: “ Their main system is has tons of info and


can require quite a bit of digging to get what you need.”

Employee 8: “There's miss communication between the


teams and much competition instead of collaboration
Information and among team members. The word used is "I" instead of
communication "we" ”

Employee 9: “ Middle management covers all difficulties


on the ground. Top management always getting the
fabricated rosy pictures ”

3) No mention: If the review does not contain content related to the above, the
label should be “No mention”.

Ref. code: 25655902320018OLX


81

BIOGRAPHY

Name Lukui Huang


Educational Attainment 2008: Bachelor in Business Administration
Finance Major
School of Economics
Beijing Technology and Business University
2009: Master of Accountancy in International
Accounting & Financial Management [MAcc]
Adam Smith Business School
University of Glasgow
Scholarship 2016-2020: Foreign Student Scholarship,
Thammasat University

Publications

Huang, L., Abrahams, A., & Ractham, P. (2022). Enhanced financial fraud
detection using cost‐sensitive cascade forest with missing value
imputation. Intelligent Systems in Accounting, Finance and Management,
29(3), 133-155.

Ref. code: 25655902320018OLX

You might also like