0% found this document useful (0 votes)
19 views

Basic Audit Data Analytics

Uploaded by

massinha10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Basic Audit Data Analytics

Uploaded by

massinha10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

Basic Audit Data Analytics

with R

Theophanis C. Stratopoulos, PhD


School of Accounting and Finance
University of Waterloo
Audit Data Analytics Committee - CPA Canada

&

Gregory P. Shields, CPA, CA


Audit Data Analytics Committee - CPA Canada

May 15, 2018

Stratopoulos & Shields, Waterloo - 2018

Electronic copy available at: https://ssrn.com/abstract=3142207


Stratopoulos & Shields, Waterloo - 2018

Electronic copy available at: https://ssrn.com/abstract=3142207


Contents

Preface xii

Authors xiv

Acknowledgments xvi

I Basic Data Analytics 1


1 Data Understanding 3
1.1 Information Regarding Audited Company . . . . . . . . . . . 3
1.2 Client Provided Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Relational Database . . . . . . . . . . . . . . . . . . . 4
1.2.2 Practice Problems . . . . . . . . . . . . . . . . . . . . . 6
1.3 Prepare R Environment (RStudio) . . . . . . . . . . . . . . . 8
1.4 Load and Review Data . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Creating Subsets . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Ordering Data . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Intro to SQL 19
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Simple SQL Queries . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 SELECT ... FROM . . . . . . . . . . . . . . . . . . . . 20
2.2.2 WHERE . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 DISTINCT . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Practice Problem: ORDER BY . . . . . . . . . . . . . 22
2.3 Creating Aggregate data . . . . . . . . . . . . . . . . . . . . . 23

Stratopoulos & Shields, Waterloo - 2018


ii CONTENTS

2.3.1 GROUP BY . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 WHERE v. HAVING . . . . . . . . . . . . . . . . . . . 24
2.3.3 Practice problems . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Create Aggregate Sales Data by Store . . . . . . . . . . 25
Load and Review Sales Data . . . . . . . . . . . . . . . 25
GROUP BY Store . . . . . . . . . . . . . . . . . . . . 26
2.4 Create Sales Data by Month . . . . . . . . . . . . . . . . . . . 26
2.4.1 Format Date Variable . . . . . . . . . . . . . . . . . . . 26
GROUP BY Month . . . . . . . . . . . . . . . . . . . . 27
2.5 Create Monthly Sales Data per Store . . . . . . . . . . . . . . 28
2.6 Merging Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.1 Practice Problems . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 INNER JOIN . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.3 LEFT JOIN . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Create Categorical Variables . . . . . . . . . . . . . . . . . . . 33
2.8 Solutions to Selected Practice Problems . . . . . . . . . . . . . 35

3 Statistical Outliers & Patterns 37


3.1 Statistical Outliers . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Detecting Outliers with IQR . . . . . . . . . . . . . . . 38
3.1.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Subset of Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Practice Problems . . . . . . . . . . . . . . . . . . . . . 41
3.3 Analyzing Sales Data by Store . . . . . . . . . . . . . . . . . . 41
3.3.1 Store Data . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Group By Store Size: Outliers . . . . . . . . . . . . . . 43
3.3.3 Stores with Largest Number of Units Sold . . . . . . . 43
3.3.4 Practice Problems . . . . . . . . . . . . . . . . . . . . . 45
3.4 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Monthly Data . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Creating a Time Series Graph . . . . . . . . . . . . . . 47
Practice Problem: Time Series Graphs . . . . . . . . . 48
3.4.3 Monthly Sales Data per Store . . . . . . . . . . . . . . 49
3.4.4 Focus on Stores from Q2 . . . . . . . . . . . . . . . . . 50
Create Time Series Graph . . . . . . . . . . . . . . . . 51
3.5 Solutions to Selected Practice Problems . . . . . . . . . . . . . 53

Stratopoulos & Shields, Waterloo - 2018


CONTENTS iii

II Basic ADA 55
4 Inventory & Cost of Sales 57
4.1 Information Regarding Audited Company . . . . . . . . . . . 57
4.2 ADA Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Obtaining an Understanding of Sales Prices . . . . . . 59
Import and Review Sales Data . . . . . . . . . . . . . . 59
Create Aggregate Sales Data . . . . . . . . . . . . . . . 60
Review Aggregated Sales Data . . . . . . . . . . . . . . 60
Summary Statistics for Aggregated Sales Data . . . . . 61
Outliers in Price Range . . . . . . . . . . . . . . . . . . 62
4.3.2 Obtaining an Understanding of Inventory Costs . . . . 62
Import and Review Inventory Data . . . . . . . . . . . 62
4.3.3 Create Aggregate Inventory Data . . . . . . . . . . . . 63
4.3.4 Review Summary Statistics - Aggregate Data . . . . . 66
4.4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Create Comparison Data . . . . . . . . . . . . . . . . . 67
4.4.2 Review Comparison Data . . . . . . . . . . . . . . . . 68
4.4.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . 69
4.4.4 Focus on Relatively Large Discrepancies . . . . . . . . 69
4.5 Evaluation and Communication . . . . . . . . . . . . . . . . . 70
4.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7.1 Most Expensive Products . . . . . . . . . . . . . . . . 72

5 Payroll 79
5.1 Information Regarding Bibitor’s Payroll . . . . . . . . . . . . 79
5.2 ADA Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Load and Review Payroll Data . . . . . . . . . . . . . 82
Practice Problems: Verify Data . . . . . . . . . . . . . 85
5.3.2 Load and Review Data from Employee Table . . . . . . 85
Headcount: HQ vs. stores . . . . . . . . . . . . . . . . 86
Practice Problem: List of job titles . . . . . . . . . . . 87
5.3.3 Load and Review Store Data . . . . . . . . . . . . . . . 87
5.3.4 Create Combined Data for ADA . . . . . . . . . . . . . 87
5.4 ADA Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Salaried and Hourly Employees . . . . . . . . . . . . . 89
Practice Problem: Deductions . . . . . . . . . . . . . . 92
5.4.2 Bi-Weekly Wages . . . . . . . . . . . . . . . . . . . . . 92

Stratopoulos & Shields, Waterloo - 2018


iv CONTENTS

5.4.3 Bi-Weekly Wages: Store Level . . . . . . . . . . . . . . 95


Practice Problems: Employee Count By Store . . . . . 97
Practice Problems: Total Wages By Store . . . . . . . 98
5.4.4 Visualize Payroll by Store Size (SqFt) . . . . . . . . . . 98
Practice Problems: Employee Count by Quartiles of
Store Size . . . . . . . . . . . . . . . . . . . . 100
Practice Problems: Wage Payments by Quartiles of
Store Size . . . . . . . . . . . . . . . . . . . . 100
5.5 Evaluation and Communication . . . . . . . . . . . . . . . . . 101
5.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7 Solutions to Selected Practice Problems . . . . . . . . . . . . . 102
5.7.1 Solution to Practice Problems - section 5.3.1 . . . . . . 102
5.7.2 Solution to Practice Problems - section 5.3.2 . . . . . . 104
5.7.3 Solution to Practice Problems - section 5.4.3 . . . . . . 105
5.7.4 Solution to Practice Problems - section 5.4.3 . . . . . . 106
5.7.5 Solutions to Practice Problems - section 5.4.4 . . . . . 107
5.7.6 Solutions to Practice Problems - section 5.4.4 . . . . . 108

III Appendix 109


A Introduction to R 111
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.2 R Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.3 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.4 R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.5 R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.5.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.5.2 Defining Variables . . . . . . . . . . . . . . . . . . . . 117
A.5.3 Cleaning Console and Environment . . . . . . . . . . . 118
A.6 Working with R files . . . . . . . . . . . . . . . . . . . . . . . 118
A.7 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.7.1 Generating basic descriptive statistics . . . . . . . . . . 120

Alphabetical Index 123

Stratopoulos & Shields, Waterloo - 2018


Preface

Abstract
Accounting professors, students and many auditors need readily available,
non-proprietary training material on audit data analytics (ADA). This need
was identified at the panel discussion on data analytics at the AAA meeting
in San Diego (August 2017). Several attendees stated that they need ma-
terials (instructions, cases, data, and code) to first train themselves in the
basics of ADA and then use the same materials to teach the topic in their
classes. In addition, feedback from a survey contracted by the CPA Canada
Audit Data Analytics Committee (Hampton and Stratopoulos 2016)1 indi-
cated that, outside of the big four public accounting firms, there are relatively
limited opportunities for training existing auditors in the use of data analyt-
ics.
“Basic Audit Data Analytics with R” is intended to meet the need noted
above. The training uses the software R because it is open-source (free) and
it provides virtually endless possibilities to those who learn it. The cases,
including practice problems, use comprehensive large data sets for an entire
company accessed from the HUB of Analytics Education. Millions of data
points are updated regularly.
The primary learning objective is to provide trainees with capabilities
required to perform entry level ADA. This means that - given a set of well-
defined objectives and a reasonably clean data set - those who have suc-
cessfully taken the training should have an understanding of how basic data
analytics can be effectively applied in a aspects of a financial statement audit.
These basics include: 1. Setting ADA objectives. Identify aspects of audit
where the audit team can use data analytics tools to obtain audit evidence.
2. Data Understanding: Identify sources of data, collect and extract data,
become familiar with data structure, identify data quality issues. 3. Data
Preparation: Be able to clean and transform data to enable effective and
1
Hampton, C., and T. C. Stratopoulos. 2016. Audit Data Analytics Use: An Ex-
ploratory Analysis. https://ssrn.com/abstract=2877358

Stratopoulos & Shields, Waterloo - 2018


vi CONTENTS

efficient analysis. 4. Modeling: Explain the model underlying the ADA in


plain English. 5. Evaluation: Leverage statistical and logical techniques to
evaluate how valuable a model is, what has been found, and what to do with
the results. 6. Communication and Documentation: Communicate and doc-
ument the results of the ADA and use new insights obtained to help answer
questions and solve problems.

Intended Learning Outcomes


The Basic Audit Data Analytics withe R is targeted at accounting students
who intend to enter the auditing profession, and auditors who are not familiar
with data analytics and the use of software - such as R - to perform these
types of procedures.
The ideal reader understands audit concepts and/or has some experience
in practicing auditing. This means, that in each chapter we are going to
provide just a very high level review of audit concepts and/or suggest reading
for those readers who want to review them. Our main focus will be on
leveraging R to perform ADA using the large data sets from the HUB of
Analytics Education project (http://www.hubae.org/). More specifically,
the objectives of this training are as follows:

1. Provide trainees with capabilities required to perform entry level ADA.


This means that - given a set of well-defined audit objectives and a rea-
sonably clean data set - those who have successfully taken the training
should be able to

• identify and extract appropriate data,


• clean and transform data to create new variables,
• perform basic descriptive analytics to understand relevant pat-
terns,
• perform basic predictive or classification analytics,
• interpret results in the context of the problem and communicate
these findings.

2. Provide trainees with a basis for pursuing additional training in the use
of more advanced ADA, such as predictive analytics involving the use
of machine learning.

Stratopoulos & Shields, Waterloo - 2018


CONTENTS vii

Framework Used for this Training Material


Each chapter will follow the general structure of the Cross Industry Stan-
dard Process for Data Mining (CRISP-DM).2 This means that each chapter
will walk the readers through the six stages of CRISP-DM: organizational
understanding, data understanding, data preparation, modeling, evaluation,
and communication. More specifically, by the end of each chapter trainees
should be able to do/achieve the following objectives:

1. Understand the auditor’s objectives of performing ADAs. Identify as-


pects of the audit where the audit engagement team can use data min-
ing concepts and tools to obtain audit evidence, taking into account,
for example:

• The particular circumstances of the audit engagement including


the nature of the entity being audited and the environment in
which it operates.
• The specific audit objectives to be achieved in performing an ADA,
taking into account, for example, whether the ADA is intended to
be a risk assessment procedure, a test of controls, or a substantive
procedure (or a combination thereof) and the particular assertions
which the ADA is intended to address.

2. Data Understanding: Identify sources of data, collect and extract data,


get familiar with data structure, identify quality issues.
3. Data Preparation: Be able to clean and transform data for to enable
effective and efficient analysis.
4. Modeling: Explain the model underlying the ADA in plain English.
5. Evaluation: Leverage mathematical (i.e., test statistics) and logical
techniques to evaluate how valuable a model is, what it has found, and
what you may want to do with the results.
6. Communication and Documentation: Communicate and document the
results of the ADA and use the new insight to answer questions and
solve problems.

2
The Wikipedia article provides a very good introduction to CRISP-DM: https://en.
wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining. How-
ever, if you want to get a more in depth understanding or hope to leverage
data mining to advance your career you should read some of the original ar-
ticles: ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/
Documentation/14/UserManual/CRISP-DM.pdf, and ftp://ftp.software.ibm.com/
software/analytics/spss/documentation/modeler/14.2/en/CRISP_DM.pdf.

Stratopoulos & Shields, Waterloo - 2018


viii CONTENTS

Matters Related to GAAS & GAAP


This training material is based on the premise that the auditor is performing
an audit in accordance with generally accepted auditing standards (GAAS).
Examples of GAAS include:

• International Standards on Auditing (ISAs).


• In the USA, standards promulgated by Public Company Accounting
Oversight Board (PCAOB) for audits of public companies and by the
American Institute of CPAs (AICPA) for audits of other entities.
• Canadian Auditing Standards (CASs).

Also, it is assumed that the financial statements being audited are meant
to be prepared in accordance with generally accepted accounting principles
(GAAP), such as the International Financial Reporting Standards (IFRS) or,
in the USA, standards promulgated by the Financial Accounting Standards
Board (FASB).
When appropriate, this training material makes brief references to as-
pects of GAAP and GAAS to help illustrate objectives and conclusions for
the example ADA. However, this is not a training course in auditing or ac-
counting. Auditors using ADA would be expected to understand GAAP
applicable to the particular engagement, and understand and comply with
all relevant requirements in the applicable GAAS.
In complying with GAAS, auditors performing ADA would consider, for
example, matters related to setting objectives for an ADA, communications
among members of the engagement team regarding the ADA as well as mat-
ters that would be communicated to management or those charged with
governance of the audited entity and audit documentation.

Setting Objectives for an ADA


An ADA may be used in performing one or more of the following:

1. Procedures for which the objective is to assess risks of material mis-


statement at the assertion level or at the financial statement level,
including obtaining an understanding of the audited entity’s business
(i.e., risk assessment procedures).
2. Procedures for which the objective is to detect material misstatements
at the assertion level (i.e. substantive procedures).
3. Procedures for which the objective is to evaluate the operating effec-
tiveness of controls in preventing, or detecting and correcting, material
misstatements at the assertion level. (i.e., tests of controls).

Stratopoulos & Shields, Waterloo - 2018


CONTENTS ix

In using an ADA for one or more of the overall purposes noted above, the
auditor will determine specific objectives which use of the ADA is intended
to achieve. Those specific objectives are used in determining, for example,
the nature and extent of the data to be used in performing the ADA, and
how that data will be analyzed.

Communications
Communications Among Engagement Team Members
Communication objectives GAAS include requirements regarding com-
munications among engagement team members. These requirements help
ensure that the planning, performance and evaluation of an ADA are of high
quality. Two-way communications enable more senior/experienced engage-
ment team members to provide appropriate direction and supervision to more
junior members of the team on the performance of the ADA. Those perform-
ing the ADA will provide timely feedback on any problems encountered in
performing the ADA, and on the results obtained. Examples of matters to
be communicated include:

1. The objectives of the ADA, including specific risks (at the assertion
level) that the ADA is intended to address, in the context of whether
the ADA is a risk assessment procedure, a test of controls, a substantive
procedure, or a combination thereof.
2. The nature, sources, format, timing, extent, level of disaggregation and
reliability of the data to be used.
3. The tools to be used to obtain the data and perform the ADA.
4. The expected output from the ADA including graphics.
5. Types of matters that may be identified in planning, performing and
evaluating the results of the ADA, including how and when such mat-
ters are to be communicated.
(a) ADA used as risk assessment procedures
• Risk of material misstatement not previously identified.
• Risk of material misstatement that is higher or lower than
previously identified.
• Matters that are relevant to planning the nature, timing and
extent of other audit procedures.
(b) ADA used to test controls over financial reporting
• Deficiencies, significant deficiencies (material weaknesses) in
internal control.

Stratopoulos & Shields, Waterloo - 2018


x CONTENTS

• Other matters relating to internal control that may be rele-


vant to management of the audited entity.
(c) ADA used as substantive procedures
• Misstatements of the financial statements (whether due to
error or fraud).

Communications with Management


Communication objectives GAAS include requirements regarding mat-
ters that the auditor is to communicate with management of the audited
entity. Such communications ensure that there is a mutual understanding
of the auditor’s use of ADA and that management is made aware of mat-
ters arising from the performance of the ADA that require management’s
attention.
Examples of matters that appropriate members of the engagement team
would discuss with management regarding use of ADA, or perhaps commu-
nicate more formally, are set out below.

1. Planning the ADA


(a) The nature, timing and extent of planned use of ADA (except
when achieving the objectives of the ADA requires an element of
surprise).
(b) Matters related to being able to obtain the required data in a
format appropriate for audit purposes.
2. Results of performing the ADA
(a) Deficiencies (including significant deficiencies/material weaknesses)
in internal control and misstatements.
(b) Graphics and related information resulting from the ADA that
might be of interest to management from the viewpoint of im-
proving aspects of the entity’s operations.

Communications with Those Charged with Governance


Communication objectives GAAS include requirements regarding mat-
ters that the auditor is to communicate with those charged with governance
of the entity being audited (for example, members of the audit committee).
These requirements help ensure that the auditor informs the audit committee
of matters relevant to their oversight of the entity.
Examples of matters to be communicated include:

Stratopoulos & Shields, Waterloo - 2018


CONTENTS xi

1. Material misstatements identified from an ADA that management has


not corrected.
2. Other significant matters that were discussed with management, includ-
ing for example, significant deficiencies/material weaknesses in internal
control over financial reporting identified as a result of use of the ADA.
3. Other significant matters revealed by the ADA that, in the auditor’s
professional judgment, are likely to be relevant to the oversight of the
financial reporting process.

Documentation of an ADA
GAAS require that the documentation of an ADA should be sufficient to
enable an experienced auditor, having no previous connection with the audit,
to understand the nature, timing and extent of the ADA, significant matters
arising in performing the ADA, and the results of the ADA, including the
audit evidence obtained.
The documentation of the example ADA described in this training would
therefore include the key aspects of the use of R in performing each stage of
the ADA noted, including for example:

1. The audit objectives in planning and performing the ADA.


2. The R script used.
3. The steps taken to understand and prepare the data for audit use,
including the sources and key attributes of the data used.
4. The models used.
5. The outputs from use of R including graphics if any.
6. The evaluation and communication of the results of the ADA.
7. The individual who performed the ADA and the date such work was
completed.
8. The individual who reviewed the ADA performed and the date and
extent of such review.

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Authors

Theophanis C. Stratopoulos is an associate professor in the School of


Accounting and Finance at the University of Waterloo. His teaching and
research focus is on how firms leverage IT enabled strategies to achieve and
sustain competitive advantage, emerging technologies, and data analytics.
Prof. Stratopoulos work has been published in top-tier accounting and sys-
tems journals (e.g. The Accounting Review, Communications of the ACM,
Journal of Management Information Systems, Journal of Strategic Informa-
tion Systems, Journal of Information Systems, and International Journal of
Accounting Information Systems).
Prof. Stratopoulos has taught a wide spectrum of students ranging from
first-year undergraduates to seasoned executives in executive MBA classes.
He has taught courses in statistics, information systems, database manage-
ment systems, e-commerce, and data mining. Theo has worked (consulted)
on data analytics projects for large US firms, and delivered workshops (train-
ing) professionals on impact of emerging technologies and use of data analyt-
ics. Theo is a member of the CPA Canada - Audit Data Analytics committee.

Gregory P. Shields, CPA, CA In 2015, Greg retired from his position as


CPA Canada Director, Auditing and Assurance Standards. In that position,
he supported the activities of the Canadian Auditing and Assurance Stan-
dards Board and served as a technical advisor to the International Auditing
and Assurance Standards Board.
Greg’s recent projects include assisting the AICPA Audit Data Analytics
Guide Working Group in developing the first edition of that guide. He is also
providing assistance to the CPA Canada Audit Data Analytics Committee in
developing a number of publications to address issues related to use of audit
data analytics. In addition, Greg is a member of CPA Canada’s Assurance
Innovation Committee. This Committee is a think tank formed to identify
and act on emerging patterns and trends that may significantly affect the
audit and assurance profession.

xiii

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Acknowledgments

The authors greatly appreciate the financial support for this education project
provided by the University of Waterloo Centre for Information Integrity and
Information Systems Assurance (UW-CISA). In addition, the project would
not be successful without the cooperation and support of professor Charles
Bame-Aldred of Northeastern University.

xv

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Part I

Basic Data Analytics

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Chapter 1

Data Understanding

Learning Objectives
By the end of this chapter, trainees should have learned how information is
organized in files in a database and how these files are interconnected. The
trainees should also have learned how to:

1. Prepare R (RStudio) for performing the ADA.


2. Load/import large data sets that were provided by the clients in .csv
format.
3. Review the structure of these data sets.
4. Use some basic R commands, including those related to generated sum-
mary statistics and graphics.

1.1 Information Regarding Audited Company


This training material focuses on the audit of the financial statements of
Bibitor, LLC, for its year ended June 30, 2016. Some key attributes of
Bibitor’s business and the environment in which it operates are as follows:

1. Bibitor is a retail liquor company. It does not sell to wholesalers nor


does it have individually significant customers to whom sales discounts
are offered.
2. All sales are cash or credit card. The company does not have trade
accounts receivable.
3. The company’s headquarters are in the State of Lincoln (a fictional
state used for illustrative purposes) and it has 79 stores located in var-
ious cities throughout the state. In common with other states, Lincoln
has extensive and rigorously enforced laws and regulations regarding

Stratopoulos & Shields, Waterloo - 2018


4 CHAPTER 1. DATA UNDERSTANDING

the sale of liquor. For example, Bibitor is not permitted to ship inven-
tory out of state.
4. The company has been selling spirits and wine products for over 50
years. The business has been successful to date. In the audit of the
preceding year, the auditor did not identify any material uncertainty
regarding Bibitor’s ability to continue as a going concern.
5. During the fiscal year under audit, the company had over 10,000 brands
available for sale. Most of the products are low to medium-priced and
intended to appeal to a wide range of consumers (all of whom are
required to be over the legal drinking age). Bibitor also stocks limited
quantities of high-priced spirits and wines. Some of these brands are
individually priced at thousands of dollars.

1.2 Client Provided Data


The client (Bibitor, LLC) has provided us with the following 11 .csv files
for fiscal year ending on June 30, 2016:

1. tSales.csv 7. tEndInv.csv
2. tPurchases.csv 8. tEmployees.csv
3. tStores.csv 9. tPayroll.csv
4. tVendors.csv 10. tExpenses.csv
5. tProducts.csv 11. tInvoices.csv
6. tBegInv.csv

The reason that the client has provided us with .csv files, rather than
Excel files, has to do with the size of these data sets. Unlike Excel files,
which are currently limited to about a million lines, .csv files do not have
an upper limit. This difference matters since some of the clients files have
more than a million observations (e.g. the sales file has over 12 million lines).

1.2.1 Relational Database


The 11 files provided by the client form a relational database. The visual
representation of this database, which is known as the entity-relationship
diagram (ERD), is shown in Figure 1.2 (p. 7). You can think of this figure
as the blueprint or the map of the database. Being able to understand and
explain the way these files/tables are related is important as it will help you
see the whole picture rather than isolated files.
The easiest way to visualize their connection is by narrating, telling a
story of the firm’s day-to-day operations. For example, the top segment of

Stratopoulos & Shields, Waterloo - 2018


1.2. CLIENT PROVIDED DATA 5

Figure 1.1: Bibitor ERD Segment

the firm’s ERD (Figure 1.1) says that the firm has stores in different loca-
tions and of different size. More specifically, information about the Bibitor
stores, such as the store identification number (Store), address (City and
Location) as well the size of each store in square feet (SqFt) is listed in the
table tStores. The first line of the stores table is shown below:

Store City Location SqFt


1 HARDERSFIELD HARDERSFIELD #1 8300

On each business day these stores generate many sales. Information about
each individual sales transaction, such as the sales identification number
(salesID), the store that made the sale (Store), the identification number
of the product sold (Brand), the salesPrice, salesQuantity, salesDate, and
exciseTax is captured in the table tSales. The first line from this table is
shown below:

salesID Store Brand SalesQuant SalesPrice SalesDate ExciseTax


1 1 10021 1 12.99 2015-07-12 0.11

Notice that in Figure 1.2 there is a line that connects the field Store from
the table tStores to the field Store from the table tSales. The line has the
number one (1) on the tStores side and the infinity sign (∞) on the tSales
side. This means that one observation (one store) on the tStores side can be
matched to a large number (many) sales transactions on the table tSales.1
1
The field Store in the tStores is known as the primary key. The same field when it
repeats in the tSales side is known as the foreign key.

Stratopoulos & Shields, Waterloo - 2018


6 CHAPTER 1. DATA UNDERSTANDING

The third table (tProducts) has information about each product that
the firm sells. This includes the product identification number (Brand),
product Description, product Size and Classification 2 , and the identification
of the vendor (VendorNo) who supplies this product. The first line from the
tProducts table is shown below.

Brand VendorNo Description Size Classification


1 58 Gekkeikan Black & Gold Sake 750mL 1

The line connecting the table tProducts to tSales starts and ends at the
product identification number (Brand). The line has 1 on the tProducts side
and ∞ on the tSales side. This means that Brand = 1, i.e., the product
number for Gekkeikan Black & Gold Sake appears only once in the table
tProducts, but the Brand = 1 appears many times, once for each sale of
the product, on the tSales table.
The last table in Figure 1.2 is the table tVendors. It captures the unique
identification number for each vendor (VendorNo), the vendor’s company
name (VendorName), and the type of vendor.3 The line connecting the table
tProducts to tVendors starts and ends at the vendor identification number
(VendorNo). The line has ∞ on the tProducts side and 1 on the tVendors
side. This means that VendorNo = 4425, i.e., the identification number
for MARTIGNETTI COMPANIES appears only once in the table tVendors
and many times - once for each product supplied by this vendor - on the
tProducts table.

VendorNo VendorName type


4425 MARTIGNETTI COMPANIES Alcohol Supplier

1.2.2 Practice Problems


Use Figure 1.2 (p. 7) to describe in your own words the following paths
related to Bibitor’s operations:

1. Stores to purchase to products to vendors.


2. Stores to purchases to invoices to vendors
3. Stores to ending inventory to products to vendors
4. Stores to employees to payroll
5. Expenses to vendors
2
Bibitor classifies liquor as 1, and wine as 2.
3
Bibitor distinguishes alcohol suppliers from other vendors.

Stratopoulos & Shields, Waterloo - 2018


1.2. CLIENT PROVIDED DATA 7

Figure 1.2: Bibitor: Entity Relationship Diagram

Stratopoulos & Shields, Waterloo - 2018


8 CHAPTER 1. DATA UNDERSTANDING

1.3 Prepare R Environment (RStudio)


In the rest of this chapter, we will use R (RStudio) to import and review
Bibitor data. If this is your first time using R, please read and follow the
directions in Appendix A on how to set up R and RStudio. Please make sure
to familiarize yourself with the RStudio environment and basics R operations
before you continue with the rest of this chapter.

1. Create a directory to store your data and analysis.


2. Download from http://www.hubae.org/ and save in this directory the
following files:
(a) tStores.csv
(b) tSales.csv
(c) tVendors.csv
(d) tProducts.csv
3. Start R Studio, and follow directions in Appendix A (p. 118) on how
to create a new R file. Name the file BBTR_intro and save it in the
same directory.
4. Set the working directory to this folder as follows:
• From the R Studio menu select Session
• select Set Working Directory
• select To Source File Location
.

Note: The complete R Script used for the creation of this chapter is avail-
able from the following URL: https://goo.gl/H2PHRW.

1.4 Load and Review Data


In the following paragraphs we are going to load and review stores data.

Import Data To import store data we are going to use the library data.table.
If this the first time you are using this package, follow directions on Appendix
A (p. 114) on how to install packages in R. Remember, we install the pack-
age once, but we have to load it using the function library() every time we
want to use it.
After the package has been loaded (library(data.table)), we use the
function fread() and specify the name of the file we want to import in
quotation marks, as follows:

Stratopoulos & Shields, Waterloo - 2018


1.4. LOAD AND REVIEW DATA 9

> library(data.table)
> tStores <- fread("tStores.csv")

Names of Variables: We can review the names of variables in the data


set using the function names(). The function takes one argument, the name
of the dataset. This means that inside the parenthesis, we specify the name
of the data set as follows:
> names(tStores)

[1] "Store" "City" "Location" "SqFt"

The data set has four variables: Store is the unique identifier for each
store/observation (primary key), City, Location, and SqFt.

Data Structure: We can get detailed information about the entire data
set (e.g., number of observations, variables, and format for each variable)
with the function str(...). The function str refers to data structure and
takes one argument, the name of the data set.
> str(tStores)

Classes 'data.table' and 'data.frame': 79 obs. of 4 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location: chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 3600 ...

The output starts with a statement indicating the classification of the data
set as data.table and data.frame. You can think of a data.frame as the
equivalent of spreadsheet in Excel. It allows you to store data in rows and
columns. The data.table is a data.frame that is designed to work with
relatively large data sets.4 The data set has 79 observations and 4 variables.
The variables Store and SqFt are integers (int), while the variables City
and Location are formatted as text (chr). Knowing the format of variables
matters when doing statistical analysis. For example, we can calculate the
average of numeric data, but we will try to create a frequency table (or pie
chart) when working with categorical data. Similarly, when performing data
mining analysis; certain methods require only numbers, while some others
can work with a combination of numbers and text.
4
A more detailed discussion of the differences between the two classes and the advan-
tages of using data.table are beyond the scope of this introductory primer.

Stratopoulos & Shields, Waterloo - 2018


10 CHAPTER 1. DATA UNDERSTANDING

Review Sample of Observations from Dataset With the function


head(...), we can view the first six observations in the dataset. With the
function tail(...), we can view the last six observations in the data set.
The script below shows the head of the store data, as an exercise you should
generate the tail of the data.

> head(tStores)

Store City Location SqFt


1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600
4: 12 LEESIDE LEESIDE #12 4600
5: 13 TARMSWORTH TARMSWORTH #13 3200
6: 14 BROMWICH BROMWICH #14 7000

The visual inspection of data does not provide significant incremental infor-
mation above what has been provided by the str() function. As we will see
below (p. 12), the functions head(...) and tail(...) are useful, when we
want to review data that has been ordered.
Both functions can take a second argument that specifies the number of
observations to be shown. With the script below, we limit the number of
observations to three.

> head(tStores,3)

Store City Location SqFt


1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600

1.4.1 Creating Subsets


There are cases when we would like to limit the number of observations or
variables that we use for our analysis. We are going to see some examples of
how we create such subsets.

Example 1: Create Subset Specifying Observations We want to cre-


ate a new data set that contains the first three observations and all variables
from the data set.. We do this by adding a square bracket after the name
of the data set. Inside the bracket we have the statement 1:3 followed by a
comma.

Stratopoulos & Shields, Waterloo - 2018


1.4. LOAD AND REVIEW DATA 11

> tStores[1:3,]

Store City Location SqFt


1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600

Example 2: Create Subset Specifying Variables We want to create


a data set that has all observations, but only the last three variables. We do
this by adding a square bracket after the name of the data set. Inside the
bracket we have the comma followed by the statement 2:4.

> tStores[, 2:4]

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600
... Lines 4 to 76 have been removed ...
77: ALNERWICK ALNERWICK #8 5400
78: BLACKPOOL BLACKPOOL #9 7000
79: BALLYMENA BALLYMENA #79 12000
City Location SqFt

Example 3: Create Subset Specifying Observations and Variables


We can create a subset that shows the first three observations and last three
variables using the combination of the above notations. Inside the bracket
we specify the number of observations (1:3) and the number of variables (2:4)
separated by a comma.
> tStores[1:3, 2:4]

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600

The main message from these examples is that when we work with data
sets we use the following format: data[rows, columns]. Using this format
we can generate the following combinations:
1. All rows and all columns by including just a comma inside the bracket:
data[,]

Stratopoulos & Shields, Waterloo - 2018


12 CHAPTER 1. DATA UNDERSTANDING

2. Specified rows and all columns, by including a constraint before the


comma and nothing after: data[rows,]
3. All rows and specified columns, by leaving the area before the comma
blank and specifying the constraint after the comma: data[,columns]
4. Specified rows and columns, by imposing constraints before and after
the comma: data[rows, columns]

Combine head() with subseting Recall that the argument within the
function head is the name of the data set. This means that we can use any
subset that we have created as the argument within the function head. The
following example, shows how we do this to achieve the same output as the
one shown in Example 3 above. Notice that the argument in the brackets
([,2:4]) is the data subset.

> head(tStores[,2:4],3)

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600

1.4.2 Ordering Data


When reviewing data, we may want to view observations in ascending or
descending order of a specific variable. For example, we may want to review
the smallest and largest Bibitor stores. In R the ordering is done using the
function order(). The argument inside the parenthesis is the name of the
variable. The default ordering is ascending (smallest to largest) and this is
done by simply entering the name of the variable. To achieve descending
order (largest to smaller) we put the negative sign in front of the variable
name (dataset$variableName).
Since our data set has more than one variable, we need to specify the
variable. We can do this by using the name of the data set, followed by the
dollar sign, and the name of the variable.

Ordering Ascending Order With the following script we specify that


we want to see the first three observations of a subset (head(subset, 3)).
The subset is made of the first three variable of the store data set that
has been ordered in terms of store size from smallest to largest (tStores[
order(tStores$SqFt) , 2:4]).

Stratopoulos & Shields, Waterloo - 2018


1.4. LOAD AND REVIEW DATA 13

> head(tStores[order(tStores$SqFt),2:4],3)
City Location SqFt
1: HORNSEY HORNSEY #3 1100
2: FURNESS FURNESS #18 2700
3: CESTERFIELD CESTERFIELD #64 2800
From this we can see that the three smallest Bibitor stores are 1100, 2700,
and 2800 square feet respectively.

Ordering Descending Order With the following script we specify that


we want to see the first three observations of a subset (head(subset, 3)).
The subset is made of the last three variables of the store data set that
has been ordered in terms of store size from largest to smallest (tStores[
order( -tStores$SqFt) , 2:4]).
> head(tStores[order(-tStores$SqFt),2:4],3)
City Location SqFt
1: GARIGILL GARIGILL #49 33000
2: EANVERNESS EANVERNESS #66 20000
3: PITMERDEN PITMERDEN #34 18400
From this we can see that the three largest Bibitor stores are 33000, 20000,
and 18400 square feet respectively. The largest store (#49) is thirty times
larger than the smallest store (#3).

1.4.3 Summary Statistics


While ordering of the data gave us some understanding of the distribution
of the data set, we can get more details by generating descriptive statistics
(min, max, median, first quartile, etc.). Again, since our data set has more
than one variable, when we want to generate our statistics, we need to specify
the variable by using the name of data set, followed by the dollar sign, and
the name of the variable.

Minimum and Maximum


> min(tStores$SqFt)
[1] 1100
> max(tStores$SqFt)
[1] 33000
The results validate what we already know by simply ordering of the data.

Stratopoulos & Shields, Waterloo - 2018


14 CHAPTER 1. DATA UNDERSTANDING

Median and Quartiles We know that if we organize the data in ascending


order, the median splits the data set 50:50. We can generate the median using
the function median() as follows:
> median(tStores$SqFt)
[1] 6400
This means that 50% of the stores have a size below 6,400 square feet and
50% above.
We can obtain more detail by using quartiles. The function for calcu-
lating quartiles is quantile(...), which takes two arguments. The first is
the variable that we want to analyze and the second is the percentage of
observations.
> quantile(tStores$SqFt, .5)
50%
6400
The two most commonly used quartiles are the first (Q1 ) and third (Q3 ).
The former is calculated by specifying that 25% (.25) of observations are
below this number and the latter by specifying that 75% (.75) are below this
value.
> quantile(tStores$SqFt, .25)
25%
4000
Approximately, 25% of Bibitor stores are 4,000 square feet or less.
> quantile(tStores$SqFt, .75)
75%
10200
Approximately, 75% of Bibitor stores are 10,200 square feet or less.

Average We can use the function mean() to calculate the average store
size as follows:
> mean(tStores$SqFt)
[1] 7893.671
The average Bibitor store is 7893.671 square feet in size.

Stratopoulos & Shields, Waterloo - 2018


1.4. LOAD AND REVIEW DATA 15

Summary Statistics Instead of generating one statistic at a time, we can


use the function summary to see all descriptive statistics for the square footage
of Bibitor stores.

> summary(tStores$SqFt)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1100 4000 6400 7894 10200 33000

1.4.4 Graphs
The main objective of looking at summary statistics is to get a better un-
derstanding of the shape of the distribution. Alternatively, we can create a
histogram. The ggplot2 package allows for the creation of some very ad-
vanced graphs.5 The following example shows how to leverage this package
for creating a histogram. The resulting histogram is shown in Figure 1.3.

> library(ggplot2)
> storeSizeHistogram <- qplot(tStores$SqFt, geom="histogram")
> storeSizeHistogram

Figure 1.3: Bibitor Store Size - version 1

The above script specifies the following:

1. We load the library (library(ggplot2)).


5
Follow directions on Appendix A (p. 114) on how to install packages in R. Remember,
we install the package once, but we have to load it using the function library() every
time we want to use it.

Stratopoulos & Shields, Waterloo - 2018


16 CHAPTER 1. DATA UNDERSTANDING

2. We choose a name for our graph (storeSizeHistogram).


3. We define how the graph will be created (<-).
4. We using the function qplot to build the graph.
5. The function has two arguments. The first is the name of the variable
(tStores$SqFt). The second is the type of graph that we want to create
(geom="histogram").
The graph confirms what we have seen in the summary statistics, i.e., the
majority of the stores (75%) have a size between 1100 and around 10,000
feet. In addition to this we see that the largest store is way to the right of
the distribution. This means that it is an outlier. We will discuss outliers
and how to identify them in Chapter 3.

Refining the Graph The ggplot2 package gives us the ability to re-
fine the graph incrementally. Changing the graph incrementally or in layers
means the we can add more to an existing graph. In the following examples,
we make some very basic changes. The default label for the x-axis in Figure
1.3 is just the name of the variable (tStores$SqFt). With the following script,
we are changing the label for the x-axis.
> storeSizeHistogram <- storeSizeHistogram
+ xlab("Store Size (Sq.Ft)")
> storeSizeHistogram
As you can see, we can do this by saying that our updated graph is equal
to the existing graph plus a new piece of information. The piece that has
been added is the instruction that specifies the x-label as xlab("Store Size
(Sq.Ft)"). Please note that we include the label in quotation marks. The
updated version of the histogram in shown in Figure 1.4.
With the following script, we repeat this one more time to update the
label for the y-axis. The resulting new histogram is shown in Figure 1.5.
> storeSizeHistogram <- storeSizeHistogram
+ ylab("Count (Frequency)")
> storeSizeHistogram
In the above example, we have changed the labels in two separate stages.
However, this is not necessary. We could have made the two statement back-
to-back as follows:
> storeSizeHistogram <- storeSizeHistogram
+ xlab("Store Size (Sq.Ft)")
+ ylab("Count (Frequency)")
> storeSizeHistogram

Stratopoulos & Shields, Waterloo - 2018


1.5. COMMUNICATION 17

Figure 1.4: Bibitor Store Size - version 2

The resulting new histogram is the same one shown in Figure 1.5.

Figure 1.5: Bibitor Store Size - version 3

1.5 Communication
The preface sets out types of matters that the auditor would consider in
evaluating, communicating and documenting ADA. The subsequent chapters
in this training material provide brief illustrations of how the auditor might
address these matters in the context of specific ADA examples.

Stratopoulos & Shields, Waterloo - 2018


18 CHAPTER 1. DATA UNDERSTANDING

1.6 Practice Problems


Continue working with the same R file (BBTR_intro.R) to answer the fol-
lowing questions:

1. Import the sales data (tSales.csv)


(a) What is the number of observations and variables in this data set?
(b) Are there any variables for which we cannot generate descriptive
statistics.
(c) Generate summary statistics for price, quantity sold. Prepare a
brief description of your results.
2. Import the vendor data (tVendors.csv)
(a) What is the number of observations and variables in this data set?
(b) Are there any variables for which we cannot generate descriptive
statistics.
3. Import the products data (tProducts.csv)
(a) What is the number of observations and variables in this data set?
(b) How many products are classified as wine and how many as liquor?
Hint: Use the function table(dataset$variableName).

Stratopoulos & Shields, Waterloo - 2018


Chapter 2

Intro to SQL

Learning Objectives
Leveraging SQL (Structured Query Language) to extract appropriate data
from a large relational database and transform these data is one of the most
basic steps of data analytics. Our objective in this chapter is to learn how to
run queries with SQL.1 More specifically, by the end of this chapter trainees
should have learned how to:

1. Use basic SQL commands to write simple queries to create and review
subsets.
2. Use SQL code to create aggregated data by grouping based on at-
tributes of one or more variables.
3. Combine (merge) two tables based on a common field and create subsets
or aggregate data.

2.1 Background
SQL is a standardized programming language commonly used to manage
relational databases. The auditor uses SQL commands (queries and other
operations written as statements) in performing an ADA. These commands
enable the auditor to obtain and analyze particular subsets of data relevant
to achieving the objectives of the ADA being performed.

1
This chapter will provide a very high level introduction to SQL. Those interested in
getting a more thorough training in SQL may want to consider one of the online training
courses, such as the https://www.w3schools.com/sql/default.asp

19

Stratopoulos & Shields, Waterloo - 2018


20 CHAPTER 2. INTRO TO SQL

2.2 Simple SQL Queries


To illustrate simple commands in SQL, we first obtain the entire set of data
we will be using for this purpose. This data is contained in the tStores table
in Bibitor’s database. The commands in R used to obtain this data are
discussed in chapter 1.2

> library(data.table)
> tStores <- fread("tStores.csv")
> str(tStores)

Classes 'data.table' and 'data.frame': 79 obs. of 4 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location: chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 ...

The data set has 79 observations (i.e., one for each of Bibitor’s stores) of the
4 variables noted in the left column of the str() output shown above.
Once the data set has been loaded, we can start using SQL commands to
extract (view) subsets of the data obtained. The advantage of SQL is that we
can generate most of the queries that we want with a handful of commands.

2.2.1 SELECT ... FROM


The most basic SQL commands are SELECT and FROM. The first one
specifies the variables we want to view and the second one the table from
which to extract the variables. SQL is not case sensitive, however in the
following examples, we use upper case simply to emphasize the SQL basic
commands, SELECT, FROM, WHERE, etc.
To write an SQL query in R, we use the function sqldf("...") from the
package sqldf.3 The actual SQL commands are included within the double
quotation marks. For example, if we want to create a new data set that has
all variables and all observations as the table tStores, the SQL is written as
follows:4
2
The complete R Script used for the creation of this chapter is available from the
following URL: https://goo.gl/oypdnG
3
If this the first time you are using this package, follow directions on Appendix A (p.
114) on how to install packages in R.
4
All basic SQL commands shown in this section are achieving the same goal that we
have achieved with sub-setting (see chapter 1 - section 1.4.1). However, we need to start
with the basic SQL commands before we introduce the more advanced ones.

Stratopoulos & Shields, Waterloo - 2018


2.2. SIMPLE SQL QUERIES 21

1. We use the command SELECT to specify the names of the variables,


separated by comma: SELECT Store, City, Location, SqFt
2. We use the command FROM to specify the table. Notice, there is no
comma after the last variable.
3. dt1_a, dt1_b, and dt1_c are the new datasets we are creating using
the SQL commands.

> library(sqldf)
> dt1_a <- sqldf("SELECT Store, City, Location, SqFt
FROM tStores")
> str(dt1_a)

'data.frame': 79 obs. of 4 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location: chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 3600 ...

table.* We can use the table.* or simply the * sign to view all variables
from the table stores as follows:

> dt1_b <- sqldf("SELECT tStores.* FROM tStores")


> str(dt1_b)

'data.frame': 79 obs. of 4 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location: chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 3600 ...

If you want to focus on only certain variables (for example, 2 variables


(store number and square footage), the SQL commands are:

> dt1_c <- sqldf("SELECT Store, SqFt FROM tStores")


> str(dt1_c)

'data.frame': 79 obs. of 2 variables:


$ Store: int 1 10 11 12 13 14 15 16 17 18 ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 3600 ...

Stratopoulos & Shields, Waterloo - 2018


22 CHAPTER 2. INTRO TO SQL

2.2.2 WHERE
If we want to limit our records to a subset that meets certain conditions we
can use the command WHERE. For example, Bibitor’s largest stores may have
operational characteristics relevant to the audit that are different from those
of its smaller stores. We may want to limit our records to stores larger than
19500 square feet. As we have seen in p. 13 this will return the two largest
stores.

> dt1_d <- sqldf("SELECT Store, SqFt


FROM tStores WHERE SqFt>19500")
> str(dt1_d)

'data.frame': 2 obs. of 2 variables:


$ Store: int 49 66
$ SqFt : int 33000 20000

2.2.3 DISTINCT
Bibitor has stores in various cities, and in some cases, more than one store
in a particular city. Each city may have its own demographic, economic
or other characteristics relevant to the audit. Additionally, aspects of how
Bibitor operates may vary for operations in cities in which it has more than
one store. It may therefore be useful, for example, to identify the cities in
which Bibitor operates. We can do this using the DISTINCT command as
follows:

> dt1_e <- sqldf("SELECT DISTINCT(City) FROM tStores")


> str(dt1_e)

'data.frame': 67 obs. of 1 variable:


$ City: chr "HARDERSFIELD" "HORNSEY" "CARDEND" "LEESIDE" ...

The results above show that Bibitor operates in 67 cities.

2.2.4 Practice Problem: ORDER BY


In SQL, we can use the ORDER BY to display our records in a specified order.
The default is ascending order. If we want to make it descending order
we can do this by adding the argument DESC as follows: ORDER BY SqFt
DESC. Visit the https://www.w3schools.com/sql/default.asp and review
the tutorial on how to order data.

Stratopoulos & Shields, Waterloo - 2018


2.3. CREATING AGGREGATE DATA 23

1. Create a query that lists the Bibitor stores in descending order of store
size.
2. Create a query that lists the Bibitor stores in ascending order of store
size.

2.3 Creating Aggregate data


Auditors often need to aggregate data to enable meaningful comparison with,
for example, data from previous years or among data populations from vari-
ous locations. We can write queries that generate aggregate results such as
average, standard deviation, or simple count of observations.
The aggregation of data results in creating one or more new variables. As
illustrated below, we use the "AS name" command to name our new variable.
If we want to use a name made of separate words we need to include this
within single quotation marks.

2.3.1 GROUP BY
In our next query, we are going to explore the ability to aggregate data by
specific group. For example, in an audit of a company such as Bibitor, it
may be useful to count how many products we buy from each supplier or
how many stores are in each city. We create an aggregate query that will
provide us with store count in each city by using the command GROUP BY
as follows:
> dt1_f <- sqldf("SELECT DISTINCT(City),
count(Store) AS storeCount
FROM tStores GROUP BY City")
> str(dt1_f)

'data.frame': 67 obs. of 2 variables:


$ City : chr "ABERDEEN" "AETHELNEY" "ALNERWICK" ...
$ storeCount: int 1 1 1 1 1 1 1 1 1 1 ...

The results show that Bibitor has stores in 67 cities. Running a query that
returns DISTINCT values for the variable storeCount, we can see (below)
that the store count ranges from 1 to 4.
> sqldf("SELECT DISTINCT dt1_f.storeCount FROM dt1_f")

storeCount
1 1

Stratopoulos & Shields, Waterloo - 2018


24 CHAPTER 2. INTRO TO SQL

2 2
3 3
4 4

2.3.2 WHERE v. HAVING


We saw above that we can use the command WHERE to impose a condition
(constraint) to limit the number of individual records in a query. If we want
to impose a condition at the group level rather than at the individual record
level, we use the command HAVING.
In the following example, we create a new data set dt1_g that provides a
list of all cities that have more than one store and number of stores in each
one of these cities.

> dt1_g <- sqldf("


SELECT DISTINCT(City), count(Store) AS storeCount
FROM tStores GROUP BY City HAVING count(Store)>1")
> dt1_g

City storeCount
1 DONCASTER 2
2 EANVERNESS 3
3 GOULCREST 2
4 HARDERSFIELD 2
5 HORNSEY 4
6 LARNWICK 2
7 MOUNTMEND 4

2.3.3 Practice problems


1. Review the example shown in https://www.w3schools.com/sql/sql_
in.asp on how to impose multiple constraints at the observation level.
Try to apply the same approach to run a query based on dt1_g and
return observations from either HORNSEY or MOUNTMEND.5
2. We can combine WHERE and HAVING by imposing constraints at the in-
dividual (row) level data and then at the aggregate level data. For
example, we may want to impose the following constraints: First im-
pose the constraint that we will be using only stores of size greater
than 10000. Second, using this subset want to find the list of cities
5
The solution to this problem is on p. 35.

Stratopoulos & Shields, Waterloo - 2018


2.3. CREATING AGGREGATE DATA 25

that have more than one store. Notice that the first constraint is a
WHERE while the second is a HAVING.

2.3.4 Create Aggregate Sales Data by Store


In the following example, we will create aggregate sales data (e.g., units sold,
average price per unit, revenues) for each store. We start by importing the
sales data.

Load and Review Sales Data


> tSales <- fread("tSales.csv")
> names(tSales)
[1] "V1" "Store" "Brand"
[4] "SalesQuantity" "SalesPrice" "SalesDate"
[7] "ExciseTax"
The data set has seven variables: V1 is the primary key (unique identifier
for each observation), Store is the store number, and Brand is the product
number. The remaining variables are self-explanatory.
We can use the names function to rename any one of the variables in
the data set. For example, with the following command we specify that we
want to rename the variable name V1 to a name that more clearly describes
the nature of the data. The left side of the command states that we want
to rename the first variable. We specify this by including within the square
bracket its position, i.e., [1]. The right side includes within quotation marks
the new variable name, i.e., salesID.
> names(tSales)[1] <- "salesID"
> str(tSales)
Classes 'data.table' and 'data.frame':
12697914 obs. of 7 variables:
$ salesID : chr "1" "2" "3" "4" ...
$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ Brand : int 10021 10051 10058 10058 10058 ...
$ SalesQuantity: int 1 1 1 1 1 2 2 2 2 2 ...
$ SalesPrice : num 13 60 14 14 14 ...
$ SalesDate : chr "2015-07-12" "2015-07-28" ...
$ ExciseTax : num 0.11 0.11 0.11 0.11 0.11 0.22 ...
The data set has 12697914 observations and 7 variables.

Stratopoulos & Shields, Waterloo - 2018


26 CHAPTER 2. INTRO TO SQL

GROUP BY Store
With the following aggregate query, we specify that we want to create store
level variables for total (sum) of units/bottles of wine or alcohol sold (sa-
lesQ_Store), average price per unit of products sold (avgPrice_Store), and
total (sum) revenues (revenue_Store).
> dt2_a <- sqldf("
SELECT Store, sum(SalesQuantity) AS salesQ_Store,
avg(SalesPrice) AS avgPrice_Store,
sum(SalesPrice*SalesQuantity) AS revenue_Store
FROM tSales GROUP BY Store")
> str(dt2_a)
'data.frame': 79 obs. of 4 variables:
$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ salesQ_Store : int 576092 403023 33962 279623 ...
$ avgPrice_Store: num 15 16.1 17.3 13.9 12.9 ...
$ revenue_Store : num 6912704 6091406 436062 3201746 1322488 ...

2.4 Create Sales Data by Month


When working with dates in SQL we need to make sure that R reads the
data properly and we use the appropriate format of the dates depending on
what we want to achieve. In this example, we focus on the dates of sales
transactions.

2.4.1 Format Date Variable


> str(tSales$SalesDate)
chr [1:12697914] "2015-07-12" "2015-07-28" "2015-07-02" ...
The structure reveals that R has read the date as a text (chr). The
results based on the first few records show that dates were formatted us-
ing a year-month-date format. We can use the function as.Date to convert
the variable format from text to date. The function takes two arguments.
The first one specifies the variable that we want to convert to a DATE for-
mat (tSales$SalesDate). The second one specifies the format of the existing
variable ("%Y-%m-%d").6
6
We use %y for year abbreviated in two digits, %Y for year in four digits, %m for month
in two digits, %b for month abbreviated, %B for complete month name, and %d for the day.
The separation can be space, back slash (/), hyphen (-).

Stratopoulos & Shields, Waterloo - 2018


2.4. CREATE SALES DATA BY MONTH 27

> tSales$SalesDate <- as.Date(tSales$SalesDate, "%Y-%m-%d")

Question: How can you verify that the above command has converted the
variable SalesDate from text(chr) to date?
The next step is to create a new variable month from the DATE formatted
variable tSales$SalesDate. The approach as you can see below is layered
like an onion and it is best understood from inside out. First, we use the
function format(...) to format/extract the month portion ("%m") of the
variable tSales$SalesDate as a date (as.Date). Second, we use the function
as.numeric(...) to convert the extracted values to numbers.7

> tSales$month <-


as.numeric(format(as.Date(tSales$SalesDate), "%m"))
> str(tSales$month)

num [1:12697914] 7 7 7 7 7 7 7 7 7 7 ...

The output above shows that the new variable that we created to capture
the month is numeric (num). For the first handful of observations it takes the
value of 7 (July).

GROUP BY Month
With the aggregate query below, we specify that we want to create the fol-
lowing monthly level variables:
• salesQ_Month, the total (sum) of units sold across all Bibitor stores.
• salesQ_Month_avgStore, the store average of total (sum) of units sold
across all Bibitor stores. In essence, this is the same as the variable
salesQ_Month divided by number of stores.
• avgPrice_Month the average price per unit.
• revenue_Month, the total sales (revenues) across all Bibitor stores.
• revenue_Month_avgStore the store average of total sales (revenues)
across all Bibitor stores. This is the same as revenue_Month divided
by number of stores.

> dt2_b <- sqldf("


SELECT month,
sum(SalesQuantity) AS salesQ_Month,
7
If the objective is to simply group by the month, the inner layer (function months)
would suffice. The reason we add the extra layers/conversions is because we want to create
a graph and use the variable month as the horizontal axis.

Stratopoulos & Shields, Waterloo - 2018


28 CHAPTER 2. INTRO TO SQL

sum(SalesQuantity)/count(DISTINCT(Store))
AS salesQ_Month_avgStore,
avg(SalesPrice) AS avgPrice_Month,
sum(SalesPrice*SalesQuantity) AS revenue_Month,
sum(SalesPrice*SalesQuantity)/count(DISTINCT(Store))
AS revenue_Month_avgStore
FROM tSales GROUP BY month")
> str(dt2_b)

'data.frame': 12 obs. of 6 variables:


$ month : num 1 2 3 4 5 6 7 8 9 10 ...
$ salesQ_Month : int 2194959 2125292 2219626 ...
$ salesQ_Month_avgStore : int 27784 26902 28096 28980 ...
$ avgPrice_Month : num 15.5 15.5 15.3 15.5 15.8 ...
$ revenue_Month : num 29854028 28876607 28988412 ...
$ revenue_Month_avgStore: num 377899 365527 366942 388908 ...

2.5 Create Monthly Sales Data per Store


We can now build on the previous steps to create monthly sales data by store.

> dt2_c <- sqldf("


SELECT Store, month,
sum(SalesQuantity) AS store_salesQ_Month,
avg(SalesPrice) AS store_avgPrice_Month,
sum(SalesPrice*SalesQuantity) AS store_revenue_Month
FROM tSales GROUP BY Store, month")
> str(dt2_c)

'data.frame': 943 obs. of 5 variables:


$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ month : num 1 2 3 4 5 6 7 8 9 10 ...
$ store_salesQ_Month : int 42052 40698 45879 41882 ...
$ store_avgPrice_Month: num 14.7 14.9 15 15.1 15.1 ...
$ store_revenue_Month : num 496657 478459 549331 488542 ...

> sqldf("SELECT dt2_c.* FROM dt2_c WHERE Store=10")

Store month store_salesQ_Month store_avgPrice_Month


1 10 1 46036 15.75836
2 10 2 45540 15.66211

Stratopoulos & Shields, Waterloo - 2018


2.6. MERGING TABLES 29

3 10 3 47725 15.40777
4 10 4 45800 15.73450
5 10 5 51742 15.71040
6 10 6 50958 15.48979
7 10 7 52846 14.97397
8 10 8 48871 14.82037
9 10 9 45936 15.33874
10 10 10 55961 15.99462
11 10 11 58761 15.43290
12 10 12 73609 16.02871

store_revenue_Month
1 581298.0
2 573628.8
3 571602.2
4 561714.6
5 645376.5
6 630687.2
7 642664.0
8 565636.5
9 549645.3
10 691374.2
11 798268.3
12 962990.4

2.6 Merging Tables


The Bibitor database has 11 tables. In our previous examples we extracted
data from one table at a time. However, in an audit engagement it is very
likely that we may have to extract data from multiple tables. For example,
consider the following scenario: We want to assess the risks of material mis-
statement that would result from a failure by Bibitor to record inventories
(i.e., spirits and wine) at the lower of its weighted average cost and net real-
izable value, in accordance with Bibitor’s accounting policy and GAAP. To
perform this analysis we will need to merge data from two tables: sales and
inventory.
The two most common SQL commands used to join (merge) tables are
the INNER JOIN (shown in table 2.1) and LEFT JOIN (shown in table 2.2).
When we merge two tables (b and a), there must be at least one field (Brand)
that has common values in both tables.

Stratopoulos & Shields, Waterloo - 2018


30 CHAPTER 2. INTRO TO SQL

With the INNER JOIN we create a new table (InJn) that has only the
records where the b.Brand = a.Brand. As we can see from table 2.1, the
table InJn has only two observations corresponding to Product 2 and Product
5 which appear in both tables.
b a InJn
Brand Price Brand Price Brand b.Price a.Price
Product 1 10
Product 2 10 Product 2 12 Product 2 10 12
Product 3 11
Product 4 20
Product 5 25 Product 5 39 Product 5 25 39

Table 2.1: Example of an INNER JOIN

When we perform a LEFT JOIN (table 2.2), the position of the table mat-
ters. The table listed first is left, and the table listed second is right. A LEFT
JOIN will return all records from the left table and the records from the right
table where the b.Brand = a.Brand.
As we can see from table 2.2 the file LfJn has all three records from the
left table (b), but only the two matching records (Product 2 and Product
5 ) from the right table (a). Notice, that in the table LfJn the a.Price for
Product 3 is missing (NA) because the product is not in table a.
b a LftJn
Brand Price Brand Price Brand b.Price a.Price
Product 1 10
Product 2 10 Product 2 12 Product 2 10 12
Product 3 11 Product 3 11 NA
Product 4 20
Product 5 25 Product 5 39 Product 5 25 39

Table 2.2: Example of a LEFT JOIN

2.6.1 Practice Problems


In table 2.2 move the a to the left and b to the right.
1. How many observations are in the new LfJn table?
2. Which observations from a will appear in the LfJn table?
3. Which observations from b will appear in the LfJn table?

Stratopoulos & Shields, Waterloo - 2018


2.6. MERGING TABLES 31

2.6.2 INNER JOIN


The data set dt2_a has aggregate data for each one of the 79 Bibitor stores.
However, it does not have the size of each store. The size of each store is
in the table tStores. With an INNER JOIN, we can create a new table that
contains sales data and store size.

> nrow(dt2_a)

[1] 79

> nrow(tStores)

[1] 79

Both tables have the same number of observations (79 stores) and a common
field (Store). Therefore, we should expect that creating a new table (dt3_a)
as an INNER JOIN between the two tables (dt2_a and tStores) would generate
the same number of observations.
Creating an INNER JOIN means that we need to specify the two tables that
will be linked (dt2_a INNER JOIN tStores), and the common field based on
which the two tables will be matched (ON dt2_a.Store=tStores.Store).8

> dt3_a <- sqldf("SELECT dt2_a.*, City, SqFt


FROM dt2_a INNER JOIN tStores ON dt2_a.Store=tStores.Store")
> str(dt3_a)

'data.frame': 79 obs. of 6 variables:


$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ salesQ_Store : int 576092 403023 33962 279623 124272 467557 ...
$ avgPrice_Store: num 15 16.1 17.3 13.9 12.9 ...
$ revenue_Store : num 6912704 6091406 436062 3201746 1322488 ...
$ City : chr "HARDERSFIELD" "ASHBORNE" "HORNSEY" ...
$ SqFt : int 8300 7900 1100 7600 2900 6500 7100 ...

The above output shows that the new table has 79 observations (stores), all
variables from dt2_a (SELECT dt2_a.*), as well as the variables City and
SqFt from the tStores table.
8
In R if we want to specify that we want to use a specific variable x from a data set
dt, we do this by using the $ as follows: dt$x. If we want to do the same in SQL we
use the period (.) to separate the data set from the variable (i.e., dt.x). To avoid error
messages, we need to make sure that our variable have names that are compatible with
SQL notation. If a table has column/field with a name that is not a single word, i.e., it
contains a period in its name, we need to replace it.

Stratopoulos & Shields, Waterloo - 2018


32 CHAPTER 2. INTRO TO SQL

2.6.3 LEFT JOIN


We leverage a LEFT JOIN whenever we need to preserve all observations from
the table listed to the left of a join, and only the matching records from the
table listed on the right. In the following example, we will join the tables
tStores and dt1_g.
The table tStores has the 79 observation (stores) and four variables (Store,
City, Location, and SqFt).
> nrow(tStores)

[1] 79

> names(tStores)

[1] "Store" "City" "Location" "SqFt"

The table dt1_g (see p. 24) has 7 observations (cities that have more
than one store), and two variables (City and storeCount).

> nrow(dt1_g)

[1] 7

> names(dt1_g)

[1] "City" "storeCount"

Start with the INNER JOIN Creating the inner join of these two tables
shows that the new data set has 19 observations. These are the 19 stores
which are in cities that have more than one store.9

> dt3_b1 <- sqldf("SELECT tStores.*, storeCount


FROM tStores INNER JOIN dt1_g ON tStores.City=dt1_g.City")
> str(dt3_b1)

'data.frame': 19 obs. of 5 variables:


$ Store : int 1 10 27 28 3 31 32 33 38 4 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "MOUNTMEND" ...
$ Location : chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 9900 3300 1100 4500 9600 ...
$ storeCount: int 2 4 4 2 4 4 4 4 2 3 ...
9
If we take the sum of all stores in table dt1_g on p. 24 we will see that it is 24.

Stratopoulos & Shields, Waterloo - 2018


2.7. CREATE CATEGORICAL VARIABLES 33

In a LEFT JOIN the position of the table to the left or right matters. In
the following example, we want all records from the table tStores and only
the matching records from the table dt1_g. This means that the tStores will
have to be on the left and dt1_g on the right (tStores LEFT JOIN dt1_g).
The rest of the query is the same as the one for the INNER JOIN.
> dt3_b2 <- sqldf("SELECT tStores.*, storeCount
FROM tStores LEFT JOIN dt1_g ON tStores.City=dt1_g.City")
> str(dt3_b2)

'data.frame': 79 obs. of 5 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location : chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 ...
$ storeCount: int 2 4 NA NA NA NA NA NA NA NA ...

The new data set has the same number of observations as the table tStores.
Please notice that some of the entries for storeCount are missing (NA). These
are the entries corresponding to cities that have only one store. Remember
that dt1_g contains only the cities with more than one store.

2.7 Create Categorical Variables


In some situations we may want to create a new categorical variable. For
example, the number of stores in a city may be used as a proxy for market
size. We can classify cities with more than two stores as large markets, cities
with two stores as average, and cities with one store as small markets. In
SQL we can create such an if-else statement using the following CASE ...
END structure:10

CASE
WHEN ... THEN ...
WHEN ... THEN ...
ELSE ...
END
In the following example, we use this approach to add a new variable mar-
ketSize in the data set dt3_2.
10
The advantage of the CASE ... END versus the typical ifelse statement is that with
the CASE ... END we can specify as many alternatives as we want. We can do this by
simply adding more WHEN ... THEN ... lines in our query.

Stratopoulos & Shields, Waterloo - 2018


34 CHAPTER 2. INTRO TO SQL

> dt3_b3 <- sqldf("SELECT dt3_b2.*,


CASE
WHEN storeCount> 2 THEN 'large'
WHEN storeCount =2 THEN 'average'
ELSE 'small'
END AS marketSize
FROM dt3_b2")
> str(dt3_b3)

'data.frame': 79 obs. of 6 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location : chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 ...
$ storeCount: int 2 4 NA NA NA NA NA NA NA NA ...
$ marketSize: chr "average" "large" "small" "small" ...

Stratopoulos & Shields, Waterloo - 2018


2.8. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 35

2.8 Solutions to Selected Practice Problems


Solution to problem 1 (p. 24).

> dt1_h <- sqldf("SELECT City, storeCount


FROM dt1_g
WHERE City IN ('HORNSEY', 'MOUNTMEND')")
> dt1_h

City storeCount
1 HORNSEY 4
2 MOUNTMEND 4

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Chapter 3

Statistical Outliers & Patterns

Learning Objectives
By the end of this chapter trainees should have learned how to:
1. Detect statistical outliers using the interquartile range (IQR) approach
and/or boxplot.
2. Create a subset that captures detected outliers for further analysis.
3. Analyze aggregate data for outliers within groups (i.e., create side-by-
side boxplots).
4. Create time-series graphs and recognize seasonal patterns.

3.1 Statistical Outliers


This chapter discusses statistical outliers. Such outliers can be a useful start-
ing point for the auditor in understanding client data. For example, they may
indicate areas on which the auditor may wish to focus certain procedures (for
example, in Bibitor’s case, stores of a particular size or in a particular lo-
cation). It is important to note, however, that a statistical outlier is not
always indicative of a matter that requires further auditor attention. Many
populations of client data contain outliers that accurately reflect aspects of
their operations. However, some of these outliers may indicate, for example,
a higher risk of material misstatement than expected by the auditor, or an
actual misstatement. Such outliers would, of course, warrant further auditor
attention.
In Chapter 1 (p. 12) we learned how to order data in ascending or de-
scending order. More specifically, we ordered Bibitor stores in terms of size
(square feet) and found out that the largest store is almost thirty times big-
ger than the smaller one. In statistics, observations which are very large or

37

Stratopoulos & Shields, Waterloo - 2018


38 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

very small are called outliers. In this section, we will learn how to leverage
R in order to detect outliers in a data set using the interquartile range and
visualizing them with the boxplot.

Interquartile range (IQR) is the difference between the third (Q3 ) and
first quartile (Q1 ). That is, the IQR measures the range of data values
in the middle of the population. Based on table 3.11 , we have that the
IQR = Q3 − Q1 = 10200 − 4000 = 6200.

min Q1 median Q3 max


1100 4000 6400 10200 33000

Table 3.1: Sueremmary Statistics - Bibitor Stores

3.1.1 Detecting Outliers with IQR


To detect outliers with the IQR approach, we create a fence which is defined
on the lower side by the value Q1 − 1.5 ∗ IQR and on the upper side by the
value Q3 + 1.5 ∗ IQR. Values which are above the upper fence or below the
lower fence are defined as outliers. The upper fence is known as the upper
whisker (uw) and the lower fence as the lower whisker (lw). Based on data
from table 3.1, we find that the lower whisker is 4000 − 1.5 ∗ 6200 = −5300
and the upper whisker is 10200 + 1.5 ∗ 6200 = 19500. The size of the smallest
store (1100) is not below the lower whisker (-5300), therefore there are no
outliers on the left side of the distribution. However, the largest store (33000)
is above the upper whisker (19500) nd is therefore considered an outlier under
this IQR approach.
With the following script we create the upper and lower whisker for store
sizes.2

> library(data.table)
> tStores <- fread("tStores.csv")
> names(tStores)

[1] "Store" "City" "Location" "SqFt"

> summary(tStores$SqFt)
1
The table is based on summary statistics generated in Chapter 1 (p. 13)
2
The complete R Script used for the creation of this chapter is available from the
following URL: https://goo.gl/QfEXDa

Stratopoulos & Shields, Waterloo - 2018


3.1. STATISTICAL OUTLIERS 39

Min. 1st Qu. Median Mean 3rd Qu. Max.


1100 4000 6400 7894 10200 33000
> IQR(tStores$SqFt)
[1] 6200
Given that the lower whisker is Q1−1.5∗IQR we calculate this as follows:
> lw <- quantile(tStores$SqFt,.25)-1.5*IQR(tStores$SqFt)
> lw
25%
-5300
Given that the upper whisker is Q3 + 1.5 ∗ IQR we calculate this as
follows:
> uw <- quantile(tStores$SqFt,.75)+1.5*IQR(tStores$SqFt)
> uw
75%
19500

3.1.2 Boxplot
A boxplot is a way of graphically showing data in their quartiles, including the
"whiskers" as noted above (i.e., indicators of the variability of data beyond
the upper quartile and below the first quartile). Relevant commands for
creating a boxplot leveraging the package ggplot2 are set out below.
First, the function ggplot specifies the data set to be used (tStores) and
the aesthetic mappings (aes) used to describe how variables in the data are
mapped to visual properties in the graphic. Second, we specify the type of
graph to be made (i.e., geom_boxplot).
> library(ggplot2)
> storeSizeBPlot <- ggplot(tStores, aes(x=Store, y=SqFt)) +
geom_boxplot()
We can refine our graph, by using the argument outlier.color within
the function geom_boxplot to specify the color of observations in the graph,
which are outliers. Figure 3.1 shows the resulting boxplot for Bibitor store
size.
> storeSizeBPlot <- storeSizeBPlot +
geom_boxplot(outlier.color = "red")
> storeSizeBPlot

Stratopoulos & Shields, Waterloo - 2018


40 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

Figure 3.1: Boxplot: Bibitor Store Size

3.2 Subset of Outliers


Figure 3.1 shows that there are two outliers. This means that there are two
stores whose size is much bigger than the rest of the stores. We can identify
these two stores by creating a subset that returns only the observations that
meet a certain condition (i.e., store size higher than the uw). In the following
example, we use an ifelse function to add a new variable in the data set
named outlier. The variable takes two values: 1 if the store size is below the
lw OR above the uw; otherwise, its value is 0.
More specifically, the function ifelse takes three arguments: First, the
condition to be assessed (tStores$SqFt<lw | tStores$SqFt>uw).3 Second,
the value (1) when the condition is true. Third, the value (0) when the
condition is false.

> tStores$sqftOutlier <-


ifelse(tStores$SqFt<lw|tStores$SqFt>uw,1,0)
> tStores[tStores$sqftOutlier==1,]

Store City Location SqFt sqftOutlier


1: 49 GARIGILL GARIGILL #49 33000 1
2: 66 EANVERNESS EANVERNESS #66 20000 1

We use the new variable to create and view the subset that returns only the
outliers (tStores[tStores$sqftOutlier == 1 , ]) above. Alternatively,
we can simply use the condition from the function ifelse to generate the
subset.
3
In R the vertical line (|) indicates the OR and the ampersand (&) the AND.

Stratopoulos & Shields, Waterloo - 2018


3.3. ANALYZING SALES DATA BY STORE 41

> tStores[tStores$SqFt<lw|tStores$SqFt>uw,]

Store City Location SqFt sqftOutlier


1: 49 GARIGILL GARIGILL #49 33000 1
2: 66 EANVERNESS EANVERNESS #66 20000 1

The first method (creating a new variable) is useful if there are a lot of
outliers/exceptions and we need to perform further statistical analysis to
understand patterns or common themes across the entire data set of outliers.
The second is more useful when dealing with just a handful of observations,
and a simple visual review would be enough to see what is going on.

3.2.1 Practice Problems


Use the IQR and/or Boxplot to answer the following questions using data
from the table tSales:
1. Are there outliers in prices of products sold?
2. How many?
3. If there are more than ten, show the top ten.
4. Are there outliers in the units sold (quantity)?
5. How many?
6. If there are more than ten, show the top ten.

3.3 Analyzing Sales Data by Store


For a multi-location audit such as that for Bibitor, it can be very useful
to develop statistics showing the quartiles of various data by store. For
example, a comparison of this data with statistics generated in previous years
may provide the auditor with information indicating significant changes from
previous years that warrant further auditor attention. Commands used to
generate summary statistics on sales relevant to the audit of Bibitor stores
are shown below.

3.3.1 Store Data


> tSales <- fread("tSales.csv")
> names(tSales)

[1] "V1" "Store" "Brand"


[4] "SalesQuantity" "SalesPrice" "SalesDate"
[7] "ExciseTax"

Stratopoulos & Shields, Waterloo - 2018


42 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

> names(tSales)[1] <- "salesID"


> str(tSales)

> library(sqldf)
> dt1_a <- sqldf("
SELECT Store, sum(SalesQuantity) AS salesQ_Store,
avg(SalesPrice) AS avgPrice_Store,
sum(SalesPrice*SalesQuantity) AS revenue_Store
FROM tSales GROUP BY Store")
> names(dt1_a)

[1] "Store" "salesQ_Store" "avgPrice_Store"


[4] "revenue_Store"

> summary(dt1_a[,2:4])

salesQ_Store avgPrice_Store revenue_Store


Min. : 33962 Min. :12.80 Min. : 436062
1st Qu.: 193070 1st Qu.:14.08 1st Qu.: 2399798
Median : 324654 Median :14.67 Median : 3961997
Mean : 408908 Mean :14.91 Mean : 5583184
3rd Qu.: 469678 3rd Qu.:15.46 3rd Qu.: 6210331
Max. :1623158 Max. :18.08 Max. :26064575

Using Mental Math Looking at these summary statistics and without


having to use R to do the IQR analysis, we can quickly establish the existence
of major outliers as follows:

1. Round generously Q1 and Q3 the target variable (e.g., salesQ_Store).


The Q1 is approximately around 20K and the Q3 approximately around
50K.
2. The IQR is approximately 30K (50-20) and the 1.5*IQR is approxi-
mately 45K.
3. The lower whisker (20 - 45 = -25k) is negative. Since our min (33962)
is positive, there are no outliers in the low end of the variable sa-
lesQ_Store.
4. The upper whisker (50 + 45 = 95k) is way below the max (1623158)
therefore there is at least one large outlier (the max itself) at the high
end of the variable salesQ_Store.

Stratopoulos & Shields, Waterloo - 2018


3.3. ANALYZING SALES DATA BY STORE 43

3.3.2 Group By Store Size: Outliers


Showing outliers on a more disaggregated basis (for example, by quartile)
may provide the auditor with a more useful picture of key aspects of a client’s
operations. For example, it may be useful to show outliers in Bibitor store
size by quartile, while keeping an eye on the two very large stores.
In the following example, we use the method CASE ... END (see p. 33) to
create a new categorical variable for store size (storeSize). The new variable
(storeSize) takes the value sqftOutlier if the store size is an outlier, Q4 if the
store size is in the 4th quartile, Q3 if in third, Q2 if in second, and Q1 if in
the first quartile. The commands used to obtain this information are set out
below.
> dt1_b <- sqldf("SELECT dt1_a.*, SqFt,
CASE
WHEN SqFt> 19500 THEN 'sqftOutlier'
WHEN SqFt> 10200 THEN 'Q4'
WHEN SqFt> 6400 THEN 'Q3'
WHEN SqFt> 4000 THEN 'Q2'
ELSE 'Q1'
END AS storeSize
FROM dt1_a INNER JOIN tStores
ON dt1_a.Store=tStores.Store")
> names(dt1_b)
[1] "Store" "salesQ_Store" "avgPrice_Store"
[4] "revenue_Store" "SqFt" "storeSize"
We can create a graph that will display a separate boxplot of each one of the
five groups of the variable storeSize, by specifying that the x-variable in the
aes is this categorical variable. The resulting graph is shown in Figure 3.2.
> ggplot(dt1_b,aes(storeSize, salesQ_Store)) +
geom_boxplot(outlier.color = "red")
From the graph we can see that there are eight outliers (red dots). Inter-
estingly, three out of these eight outliers are associated with relatively small
stores (i.e., stores from the second quartile).

3.3.3 Stores with Largest Number of Units Sold


In chapter 1 (p. 12) we have learned how to organize and display data in as-
cending or descending order. In the following example, we apply this method
to observe the ten largest in stores in terms of units sold (salesQ_Store).

Stratopoulos & Shields, Waterloo - 2018


44 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

Figure 3.2: Boxplot: Bibitor Store Size Grouped

Interestingly, the largest volume of sales (units sold) was not generated
by one of the largest store in size. The largest volume (1623158 bottles) came
from a store in the second quartile (Q2 ) in terms of store size.
Consistent with what we have seen in Figure 3.2, there are four obser-
vations (one red dot from Q2 and three from Q4) that are higher than the
max volume generated by one of the two largest stores. More specifically,
the max volume of store #76 from Q2, and the max volumes of stores #73,
#34, and #38 from Q4, are higher than the max volume of store #66 which
is the second largest store. The sales volume of the largest store (#49) is not
in the top ten list.

> head(dt1_b[order(-dt1_b$salesQ_Store),],10)

Store salesQ_Store avgPrice_Store revenue_Store SqFt


76 76 1623158 18.08060 26064575 5400
73 73 1415916 17.77670 22554592 15000
34 34 1369606 17.27582 20730030 18400
38 38 1337209 17.36106 19505303 14000
66 66 1140215 18.03583 18153040 20000
67 67 977455 17.78134 15487755 9200
69 69 976492 16.87477 14516917 4800
50 50 899439 16.85767 13958967 5800
60 60 826652 16.45769 12019129 9000
15 15 731619 15.59824 10324675 12200

storeSize
76 Q2

Stratopoulos & Shields, Waterloo - 2018


3.4. TIME SERIES 45

73 Q4
34 Q4
38 Q4
66 sqftOutlier
67 Q3
69 Q2
50 Q2
60 Q3
15 Q4

3.3.4 Practice Problems


Use the data from dt1_b to perform the following analysis (create boxplots):4
1. Look for outliers in average prices (avgPrice_Store) for each group of
stores.
2. Look for outliers in store revenues (revenue_Store) for each group of
stores.
3. Create a subset that lists the outliers (if any) from stores in Q2.

3.4 Time Series


Time series data (e.g., showing data in a series of years, months, weeks etc.)
can be very useful to the auditor. For example, for the audit of Bibitor, the
auditor would expect sales to be higher in months in which there are statu-
tory holidays. Also some sales might be seasonal (for example, beer sales
might, on average, be higher in the hot days of summer compared to win-
ter). When sales in a period are significantly lower or higher than what the
auditor expects, the auditor would perform further procedures to obtain an
explanation. In some cases, management will have a reasonable explanation
for which the auditor can obtain corroborating evidence. In other cases, the
variation from expectations might be the result of a material misstatement.
The commands set out below build on those previously presented to enable
the auditor to develop a useful time series analysis, including graphics to
clearly show variations in variables over time.

3.4.1 Monthly Data


Performing an analysis similar to the one shown in section 2.4, we generate
monthly data for Bibitor sales.
4
The solution for the first and second problem is on p. 53.

Stratopoulos & Shields, Waterloo - 2018


46 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

> str(tSales$SalesDate)

chr [1:12697914] "2015-07-12" "2015-07-28" "2015-07-02" ...

> tSales$SalesDate <- as.Date(tSales$SalesDate, "%Y-%m-%d")


> tSales$month <-
as.numeric(format(as.Date(tSales$SalesDate), "%m"))
> str(tSales$month)

num [1:12697914] 7 7 7 7 7 7 7 7 7 7 ...

> dt2_a <- sqldf("


SELECT month,
sum(SalesQuantity) AS salesQ_Month,
sum(SalesQuantity)/count(DISTINCT(Store))
AS salesQ_Month_avgStore,
avg(SalesPrice) AS avgPrice_Month,
sum(SalesPrice*SalesQuantity) AS revenue_Month,
sum(SalesPrice*SalesQuantity)/count(DISTINCT(Store))
AS revenue_Month_avgStore
FROM tSales GROUP BY month")
> names(dt2_a)

[1] "month" "salesQ_Month"


[3] "salesQ_Month_avgStore" "avgPrice_Month"
[5] "revenue_Month" "revenue_Month_avgStore"

Using the function head() we review the aggregate sales data for the first
six months.

> head(dt2_a)

month salesQ_Month salesQ_Month_avgStore avgPrice_Month


1 1 2194959 27784 15.48994
2 2 2125292 26902 15.53356
3 3 2219626 28096 15.33772
4 4 2289425 28980 15.52159
5 5 2624496 33221 15.81803
6 6 2858944 36189 15.72103

revenue_Month revenue_Month_avgStore
1 29854028 377899.1

Stratopoulos & Shields, Waterloo - 2018


3.4. TIME SERIES 47

2 28876607 365526.7
3 28988412 366941.9
4 30723735 388908.0
5 36041211 456217.9
6 39290701 497350.7

Using the function summary() we review the basic descriptive statistics


for monthly sales data.

> summary(dt2_a[,2:6])

salesQ_Month salesQ_Month_avgStore avgPrice_Month


Min. :2125292 Min. :26902 Min. :15.30
1st Qu.:2271975 1st Qu.:28759 1st Qu.:15.44
Median :2728502 Median :34705 Median :15.49
Mean :2691980 Mean :34271 Mean :15.55
3rd Qu.:2891140 3rd Qu.:36833 3rd Qu.:15.58
Max. :3510144 Max. :44432 Max. :15.97

revenue_Month revenue_Month_avgStore
Min. :28876607 Min. :365527
1st Qu.:30506308 1st Qu.:386156
Median :37234433 Median :474440
Mean :36755959 Mean :467950
3rd Qu.:40376265 3rd Qu.:517644
Max. :48769674 Max. :617338

Mental Math Problem A data analyst glanced through these data and
said that: There are no extreme outliers in these data. Can you validate
whether this statement is correct? Are there extreme outliers in these vari-
ables? Remember, the objective of mental math is to establish extreme
outliers. This means that you should round generously.

3.4.2 Creating a Time Series Graph


We can use the package ggplot2 to generate a time series graph as follows:

1. Within the ggplot we specify the data set (dt2_a) and in the aes
argument we specify just the time (x=month).
2. We add (+) the function geom_point, which means that we want the
see the data points. Within the function and using aes we specify the
y-variable (y=salesQ_Month).

Stratopoulos & Shields, Waterloo - 2018


48 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

3. We add (+) the function geom_line, which means that we want a line
to connect the data points. Within the function and using aes we
specify the y-variable (y=salesQ_Month).
4. The x-axis (month) is numeric. To avoid breaks that may include
decimal points, we add (+) the function scale_x_continuous, and we
specify that we want the values to go from 1 to 12, in increments of 1.

> g4_salesQ_Month <- ggplot(dt2_a, aes(x=month)) +


geom_point(aes(y=salesQ_Month))+
geom_line(aes(y=salesQ_Month))+
scale_x_continuous(breaks=seq(1, 12, 1))
> g4_salesQ_Month

The resulting graph is shown in Figure 3.3. From this we can see that
Bibitor’s volume of units sold has a spike in July (7) and in December (12).
The volume follows an upward trend from January through July (1-7) and
seems to remain relatively stable at an average level from August to Novem-
ber (8-11).

Figure 3.3: Bibitor: Monthly Units Sold (Total)

Practice Problem: Time Series Graphs


1. Create the time series graph for Monthly Units Sold - Average Per Store
(salesQ_Month_avgStore).5
2. Create the time series graph for Monthly Revenue (revenue_Month).
3. Create the time series graph for Monthly Revenue - Average Per Store
(revenue_Month_avgStore).
5
The solution to these problems in on p. 53

Stratopoulos & Shields, Waterloo - 2018


3.4. TIME SERIES 49

3.4.3 Monthly Sales Data per Store


Performing an analysis similar to the one shown in section 2.5, we generate
monthly sales data for each Bibitor store.
> dt2_b <- sqldf("
SELECT Store, month,
sum(SalesQuantity) AS store_salesQ_Month,
avg(SalesPrice) AS store_avgPrice_Month,
sum(SalesPrice*SalesQuantity) AS store_revenue_Month
FROM tSales GROUP BY Store, month")
> names(dt2_b)
[1] "Store" "month"
[3] "store_salesQ_Month" "store_avgPrice_Month"
[5] "store_revenue_Month"
> head(dt2_b)
Store month store_salesQ_Month store_avgPrice_Month
1 1 1 42052 14.67951
2 1 2 40698 14.90443
3 1 3 45879 15.03683
4 1 4 41882 15.06522
5 1 5 43997 15.14461
6 1 6 47132 15.06561

store_revenue_Month
1 496657.2
2 478458.7
3 549330.8
4 488541.6
5 523794.3
6 570393.5
> summary(dt2_b[,3:5])
store_salesQ_Month store_avgPrice_Month store_revenue_Month
Min. : 2245 Min. :12.36 Min. : 29427
1st Qu.: 15983 1st Qu.:14.04 1st Qu.: 190772
Median : 26191 Median :14.66 Median : 324252
Mean : 34256 Mean :14.89 Mean : 467732
3rd Qu.: 40543 3rd Qu.:15.42 3rd Qu.: 531786
Max. :202865 Max. :18.91 Max. :3214605

Stratopoulos & Shields, Waterloo - 2018


50 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

Mental Math Problem A data analyst glanced through these data and
said that: There seem to be extreme outliers in volume and revenue but not
in price. Can you validate whether this statement is correct?

3.4.4 Focus on Stores from Q2


As we found earlier (section 3.3.3), stores #50, #69 and #76 are outliers in
quartile 2 (Q2 ). It would likely be useful to compare monthly data for these
three stores. We can create a subset that has monthly sales for just these
stores by specifying that observations in the data set should have a store
number that belongs (%in%) in the a specific set (c(76,69,50)).
> dt2_c <- dt2_b[dt2_b$Store %in% c(76,69,50), ]
> head(dt2_c,12)
Store month store_salesQ_Month store_avgPrice_Month
589 50 1 59756 16.81343
590 50 2 57630 16.79899
591 50 3 64191 16.57544
592 50 4 63425 17.04199
593 50 5 69654 17.04532
594 50 6 72119 16.87437
595 50 7 91195 16.40543
596 50 8 73097 16.27621
597 50 9 63318 16.38402
598 50 10 78793 16.49201
599 50 11 94747 17.05416
600 50 12 111514 18.03548

store_revenue_Month
589 908139.8
590 871620.8
591 947189.6
592 974966.6
593 1057386.3
594 1119092.8
595 1459538.6
596 1119384.4
597 934588.4
598 1184220.2
599 1601722.9
600 1781116.5

Stratopoulos & Shields, Waterloo - 2018


3.4. TIME SERIES 51

Create Time Series Graph


The objective here is to show three time series (one for each store) in the
same graph. The approach is the same as the one in section 3.4.2, with one
small difference. In the function geom_point and geom_line and inside the
aes, we add the following statement: color=Store. What this means is that
we treat each one of the different values in the variable Store as a separate
group and color code these groups.

> g4_store_salesQ_Month <- ggplot(dt2_c, aes(x=month)) +


geom_point(aes(y=store_salesQ_Month, color=Store))+
geom_line(aes(y=store_salesQ_Month, color=Store))+
scale_x_continuous(breaks=seq(1, 12, 1))
> g4_store_salesQ_Month

The resulting Figure 3.4 shows that store #76 (blue) is experiencing a much
bigger spike in the summer months (around July) than both other stores.

Figure 3.4: Monthly Units Sold - Average Per Q2 Store

As the final step, we would like to create a graph that lets us compare
the time series for these stores with monthly volume for the average Bibitor
store. These data have been captured in variable salesQ_Month_avgStore
in the data set dt2_a (See section 3.4.1).
Taking advantage of the incremental approach used to create ggplot we
can do this by adding into the existing graph g4_store_salesQ_Month. More
specifically, we add the functions geom_point and geom_line and inside the
aes, we specify the new data set dt2_a. This way R knows the source of the
y-variable. In addition to this and since we are adding only one series, we
explicitly specify that ggplot should assign a color/label for these data with
the following statement: color="avgStore.

Stratopoulos & Shields, Waterloo - 2018


52 CHAPTER 3. STATISTICAL OUTLIERS & PATTERNS

> g4_store_salesQ_Month <- g4_store_salesQ_Month +


geom_point(data=dt2_a,
aes(y=salesQ_Month_avgStore, color="avgStore")) +
geom_line(data=dt2_a,
aes(y=salesQ_Month_avgStore, color="avgStore"))
> g4_store_salesQ_Month

The resulting graph is shown in Figure 3.5

Figure 3.5: Monthly Units Sold - Average Per Q2 Store vs. Average of All
Stores

Stratopoulos & Shields, Waterloo - 2018


3.5. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 53

3.5 Solutions to Selected Practice Problems


Solution to problems: 3.3.4 (p. 45).
ggplot(dt1_b, aes(storeSize, avgPrice_Store)) +
geom_boxplot(outlier.color = "blue")

ggplot(dt1_b,aes(storeSize, revenue_Store)) +
geom_boxplot(outlier.color = "green")

Solution to problems on Time Series Graphs: 3.4.2 (p. 48)


> g4_salesQ_Month_avgStore <- ggplot(dt2_a, aes(x=month)) +
geom_point(aes(y=salesQ_Month_avgStore))+
geom_line(aes(y=salesQ_Month_avgStore))+
scale_x_continuous(breaks=seq(1, 12, 1))
> g4_salesQ_Month_avgStore

Figure 3.6: Bibitor: Monthly Units Sold - Average Per Store

g4_revenue_Month <- ggplot(dt2_a, aes(x=month)) +


geom_point(aes(y=revenue_Month))+
geom_line(aes(y=revenue_Month))+
scale_x_continuous(breaks=seq(1, 12, 1))
g4_revenue_Month
g4_revenue_Month_avgStore <- ggplot(dt2_a, aes(x=month)) +
geom_point(aes(y=revenue_Month_avgStore))+
geom_line(aes(y=revenue_Month_avgStore))+
scale_x_continuous(breaks=seq(1, 12, 1))
g4_revenue_Month_avgStore

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Part II

Basic ADA

55

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Chapter 4

Inventory, Cost of Sales & Sales

Learning Objectives
By the end of this chapter trainees should have achieved the following objec-
tives:

1. Build on matters learned in chapters 1 - 3, to ...


(a) Load/import large data sets that were provided by the client in
.csv format.
(b) Use summary statistics to understand the distribution of target-
ed/key variables.
(c) Develop SQL queries to generate aggregate variables, e.g., sum of
units sold per product.
2. Develop SQL queries for merging data from different data sets and
create new variables.
3. Apply these to an ADA used to obtain an understanding of a company’s
business, and assess risks of material misstatement in inventories.

4.1 Information Regarding Audited Company


This chapter focuses on the audit of the financial statements of Bibitor, LLC,
for its year ended June 30, 2016. Some key attributes of Bibitor’s business
and the environment in which it operations are as follows:

• Bibitor is a retail liquor company. It does not sell to wholesalers nor


does it have individually significant customers to whom sales discounts
are offered.

57

Stratopoulos & Shields, Waterloo - 2018


58 CHAPTER 4. INVENTORY & COST OF SALES

• All sales are cash or credit card sales. The company does not have
trade accounts receivable.
• The company’s headquarters are in the State of Lincoln (a fictional
state used for illustrative purposes) and it has 79 stores located in var-
ious cities throughout the state. In common with other states, Lincoln
has extensive and rigorously enforced laws and regulations regarding
the sale of liquor. For example, Bibitor is not permitted to ship inven-
tory out of state.
• The company has been selling spirits and wine products for over 50
years. The business has been successful to date. In the audit of the
preceding year, the auditor did not identify any material uncertainty
regarding Bibitor’s ability to continue as a going concern.
• During the fiscal year under audit, the company had over 10,000 brands
available for sale. Most of the products are low to medium-priced and
intended to appeal to a wide range of consumers (all of whom are
required to be over the legal drinking age). Bibitor also stocks limited
quantities of high-priced spirits and wines. Some of these brands are
individually priced at thousands of dollars.

Bibitor’s management states that the company’s financial statements are


meant to be prepared in accordance with US generally accepted accounting
principles (GAAP) as promulgated by the Financial Accounting Standards
Board. Management’s accounting policy, consistent with GAAP, is to record
its inventory of spirits and wine at the lower of weighted average cost and
net realizable value.
This chapter illustrates aspects of how basic ADA can be applied to audit-
ing certain aspects of a company’s sales prices, purchase prices for inventory,
and inventory on hand.

4.2 ADA Objectives


This ADA is being performed to assess the risks of material misstatement
that would result from a failure by Bibitor to record inventories (i.e., spirits
and wine) at the lower of its weighted average cost and net realizable value,
in accordance with Bibitor’s accounting policy and GAAP.
Bibitor estimates net realizable value as the amount that inventories are
expected to be sold at, less the estimated costs necessary to make the sale.
Inventories are written down to net realizable value when management esti-
mates that the cost of inventories is not recoverable.
Specifically, the objectives of this ADA are to:

Stratopoulos & Shields, Waterloo - 2018


4.3. DATA UNDERSTANDING 59

1. Provide an understanding of the prices Bibitor charged to its customers


for its products during 2016.
2. Identify any notable items (for example, unusually high or low prices)
that may indicate the existence of pricing errors which have audit im-
plications.
3. Identify products, if any, for which selling prices (an indicator of net
realizable value) are less than cost indicating a risk of a material mis-
statement related to a failure to write down inventories to net realizable
value.

4.3 Data Understanding


4.3.1 Obtaining an Understanding of Sales Prices
Import and Review Sales Data

We load and review the sales data. For a review of the commands used and
the interpretation of results see section 2.3.4 (p. 25).1

> library(data.table)
> tSales <- fread("tSales.csv")
> names(tSales)
> names(tSales)[1] <- "salesID"
> str(tSales)

Classes 'data.table' and 'data.frame':


12697914 obs. of 7 variables:
$ salesID : chr "1" "2" "3" "4" ...
$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ Brand : int 10021 10051 10058 10058
10058 10058 10058 10058 10058 10058 ...
$ SalesQuantity: int 1 1 1 1 1 2 2 2 2 2 ...
$ SalesPrice : num 13 60 14 14 14 ...
$ SalesDate : chr "2015-07-12" "2015-07-28"
"2015-07-02" "2015-07-04" ...
$ ExciseTax : num 0.11 0.11 0.11 0.11 0.11
0.22 0.22 0.22 0.22 0.22 ...
1
The complete R Script used for the creation of this chapter is available from the
following URL: https://goo.gl/NTEgm8

Stratopoulos & Shields, Waterloo - 2018


60 CHAPTER 4. INVENTORY & COST OF SALES

Create Aggregate Sales Data


The next step is to decide how to organize the large amount of data obtained
in a way that will provide information that is useful to the audit. For this
ADA, information regarding range of prices charged for each brand (product)
(i.e., the maximum and minimum prices charged and the range between those
prices) would be helpful. It would also be useful to know the average price
charged for each brand, so sales quantity data will be used as well as prices.
Therefore, we use R to aggregate sales records/transactions at the product
(brand) level. For each one of the products we will calculate the following
variables:

1. sum of units sold


2. average price
3. minimum price
4. maximum price
5. range of prices at which each product was sold

> library(sqldf)
> tSalesByProduct <- sqldf("
SELECT Brand, sum(SalesQuantity) AS sumQSales,
avg(SalesPrice) AS avgPrice, max(SalesPrice) AS maxPrice,
min(SalesPrice) AS minPrice,
(max(SalesPrice)-min(SalesPrice)) AS rangePrice
FROM tSales GROUP BY Brand")

Review Aggregated Sales Data


Having aggregated the sales data, it is useful to review the results of the
aggregation to confirm that we have organized it in a way that is useful. If
not, we can redo the aggregation in a different way. To review the aggrega-
tion we use the head command in R to present the first few lines of our data
aggregation. As shown below, we have chosen to display the first 5 items
which relate to the products with the lowest 5 brand identifier numbers (see
column 2).

> names(tSalesByProduct)

[1] "Brand" "sumQSales" "avgPrice" "maxPrice"


[5] "minPrice" "rangePrice"

> head(tSalesByProduct,5)

Stratopoulos & Shields, Waterloo - 2018


4.3. DATA UNDERSTANDING 61

Brand sumQSales avgPrice maxPrice minPrice rangePrice


1 58 3163 12.99000 12.99 12.99 0
2 60 1931 10.62240 10.99 9.99 1
3 61 281 13.99000 13.99 13.99 0
4 62 2997 38.29359 41.99 36.99 5
5 63 2498 40.28630 43.99 38.99 5

Summary Statistics for Aggregated Sales Data


Having developed aggregate data by product in the preceding step, R can
be used to generate statistics from this data to provide a basis for obtaining
insights regarding the population of Bibitor’s sales prices for the year. The
statistics generated in this example relate to the data fields 2 to 5 (i.e. brand
number is excluded). That is because brand number is not relevant to this
part of the analysis. However, as we will see later, we can and will drill
down to get information at the brand level when warranted.

> summary(tSalesByProduct$avgPrice)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.49 10.49 15.68 32.64 28.23 4999.99

> summary(tSalesByProduct[,2:6])

sumQSales avgPrice maxPrice


Min. : 1 Min. : 0.49 Min. : 0.49
1st Qu.: 32 1st Qu.: 10.49 1st Qu.: 11.99
Median : 254 Median : 15.68 Median : 16.99
Mean : 3084 Mean : 32.64 Mean : 35.32
3rd Qu.: 1956 3rd Qu.: 28.23 3rd Qu.: 29.99
Max. :319248 Max. :4999.99 Max. :13999.90

minPrice rangePrice
Min. : 0.00 Min. : 0.000
1st Qu.: 9.99 1st Qu.: 0.000
Median : 14.99 Median : 0.000
Mean : 31.78 Mean : 3.537
3rd Qu.: 27.95 3rd Qu.: 3.000
Max. :4999.99 Max. :13967.910

The above statistics would be reviewed in the context of the auditor’s under-
standing of the company’s operations obtained in previous years’ audits. For

Stratopoulos & Shields, Waterloo - 2018


62 CHAPTER 4. INVENTORY & COST OF SALES

example, in this case,the auditor likely would not be surprised at the wide
range between maximum and minimum prices. This is because Bibitor’s
products have traditionally included very small sample bottles, costing pen-
nies as well as individual bottles of very expensive whiskies and wines. If
the statistics did not show this wide variation, then this would not be con-
sistent with the auditor’s understanding of the company’s operations and
would warrant further attention. In this case, the existence of “0” sales
prices would warrant investigation to determine the reason why zero prices
exist.

Outliers in Price Range


The summary statistics show some extreme outliers. Lets take a closer look
at the price range.2

> tSalesByProduct[tSalesByProduct$rangePrice>100,]

Brand sumQSales avgPrice maxPrice minPrice rangePrice


153 368 37 1368.41105 1559.99 1299.99 260.00
180 423 4 4628.49000 4696.99 4559.99 137.00
1181 2696 3587 65.70574 13999.90 31.99 13967.91
9304 35989 4 1482.49000 1664.99 1299.99 365.00
9315 36063 14 1233.32333 1499.99 1099.99 400.00

What might be the implications of wide price ranges? A wide price range,
particularly a range as wide for that of brand 2696, may be an indication of
a pricing error. If the error is the result of a control deviation, there could a
significant number of other pricing errors.

4.3.2 Obtaining an Understanding of Inventory Costs


Import and Review Inventory Data
> tEndInv <- fread(".../BBTR_20160630_rdb/tEndinv.csv")
> names(tEndInv)
2
A price range of greater than $100 has been used to illustrate the basics of performing
this step. In practice, the determination of an appropriate price range likely would be
more complex. The objective would be to set a range that would identify outliers likely
to be significant to the audit (i.e., warrant further auditor attention). For example, if
there was, say, a 20% fluctuation in the price ranges of many lower priced brands, and
this was not consistent with what the auditor expected, given his or her knowledge of the
company’s business, this would likely warrant further auditor attention. Identification of
outliers using this more complex approach is beyond what is meant to be illustrated here.

Stratopoulos & Shields, Waterloo - 2018


4.3. DATA UNDERSTANDING 63

[1] "V1" "Store" "Brand" "onHand" "PPrice" "endDate"

> names(tEndInv)[1] <- "endInvID"


> str(tEndInv); head(tEndInv)

Classes 'data.table' and 'data.frame':


214211 obs. of 6 variables:
$ endInvID: chr "1" "2" "3" "4" ...
$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ Brand : int 58 62 63 72 75 77 79 115 126 165 ...
$ onHand : int 12 11 3 6 9 58 6 16 19 16 ...
$ PPrice : num 13 37 39 35 15 ...
$ endDate : chr "2016-06-30" "2016-06-30" "2016-06-30" ...

endInvID Store Brand onHand PPrice endDate


1: 1 1 58 12 9.28 2016-06-30
2: 2 1 62 11 28.67 2016-06-30
3: 3 1 63 3 30.46 2016-06-30
4: 4 1 72 6 26.11 2016-06-30
5: 5 1 75 9 10.94 2016-06-30
6: 6 1 77 58 10.39 2016-06-30

4.3.3 Create Aggregate Inventory Data


To perform our analysis we need to aggregate inventory data at the product
level. In SQL this means that we need to group the observations by product
(GROUP BY Brand). If we assume that there are n products; for each one of
the products, we will need to calculate the following variables:

1. End inventory cost (endInvCost) is calculated as the sum of acquiring


cost per unit (Price) times quantity on hand (onHand):
n
X
endInvCost = (onHand ∗ P P rice)
i=1

In SQL we specify this as follows:


• sum(onHand*PPrice) AS endInvCost
2. Average acquiring cost (avgPPrice) is the average of acquiring cost
(PPrice) per product.
• avg(PPrice) AS avgPPrice

Stratopoulos & Shields, Waterloo - 2018


64 CHAPTER 4. INVENTORY & COST OF SALES

3. Maximum acquiring cost (maxPPrice) is the maximum of acquiring


cost (PPrice) per product.
• max(PPrice) AS maxPPrice
4. Minimum acquiring cost (minPPrice) is the minimum of acquiring cost
(PPrice) per product.
• min(PPrice) AS minPPrice
5. Range of acquiring costs (rangePPrice) at which each product was
purchased.

rangeP P rice = max(P P rice) − min(P P rice)

• (max(PPrice) - min(PPrice)) AS rangePPrice


6. Total quantity on hand (totalQoH) is the sum of quantity on had
(onHand) per product:
n
X
totalQoH = (onHand)
i=1

• sum(onHand) AS totalQoH

With the following R script (SQL query) we create the new variables and
save them in a new data set named tEndInvByProduct.

> tEndInvByProduct <- sqldf("


SELECT Brand, sum(onHand*PPrice) AS endInvCost,
avg(PPrice) AS avgPPrice,
max(PPrice) AS maxPPrice, min(PPrice) AS minPPrice,
(max(PPrice)-min(PPrice)) AS rangePPrice,
sum(onHand) AS totalQoH
FROM tEndInv GROUP BY Brand")
> str(tEndInvByProduct)

We review the structure (str) of the variables in the tEndInvByProduct data


set. As we can see the new data set has 8413 observations (corresponding to
unique products in the firm’s inventory) and seven variables (the Brand/code
for each product and the six aggregate variables). With the exception of two
variables (Brand and totalQoH) which are integers, all remaining variables
are numeric.

> str(tEndInvByProduct)

Stratopoulos & Shields, Waterloo - 2018


4.3. DATA UNDERSTANDING 65

'data.frame': 8413 obs. of 7 variables:


$ Brand : int 58 60 61 62 63 70 72 75 77 79 ...
$ endInvCost : num 3424 229 127 13360 12214 ...
$ avgPPrice : num 9.28 7.4 10.6 28.67 30.46 ...
$ maxPPrice : num 9.28 7.4 10.6 28.67 30.46 ...
$ minPPrice : num 9.28 7.4 10.6 28.67 30.46 ...
$ rangePPrice: num 0 0 0 0 0 0 0 0 0 0 ...
$ totalQoH : int 369 31 12 466 401 5 147 9 1685 1014 ...

Review the top (head) and bottom (tail) observations of the tEndInvByProduct
data set.

> head(tEndInvByProduct); tail(tEndInvByProduct)

Brand endInvCost avgPPrice maxPPrice minPPrice rangePPrice


1 58 3424.32 9.28 9.28 9.28 0
2 60 229.40 7.40 7.40 7.40 0
3 61 127.20 10.60 10.60 10.60 0
4 62 13360.22 28.67 28.67 28.67 0
5 63 12214.46 30.46 30.46 30.46 0
6 70 69.80 13.96 13.96 13.96 0

totalQoH
1 369
2 31
3 12
4 466
5 401
6 5

Brand endInvCost avgPPrice maxPPrice minPPrice rangePPrice


8408 90088 4530.54 92.46 92.46 92.46 0
8409 90089 13168.48 77.92 77.92 77.92 0
8410 90090 20172.15 448.27 448.27 448.27 0
8411 90604 3685.74 78.42 78.42 78.42 0
8412 90609 5032.00 17.00 17.00 17.00 0
8413 90631 4739.28 12.74 12.74 12.74 0

totalQoH
8408 49
8409 169

Stratopoulos & Shields, Waterloo - 2018


66 CHAPTER 4. INVENTORY & COST OF SALES

8410 45
8411 47
8412 296
8413 372

4.3.4 Review Summary Statistics - Aggregate Data


For the set of these six aggregate variables (2nd through 7th in the data set
tEndInvByProduct), we calculate summary descriptive statistics as follows:

> summary(tEndInvByProduct[,2:7])

endInvCost avgPPrice maxPPrice


Min. : 0.0 Min. : 0.36 Min. : 0.36
1st Qu.: 355.2 1st Qu.: 7.18 1st Qu.: 7.18
Median : 2077.0 Median : 10.81 Median : 10.81
Mean : 6211.5 Mean : 28.43 Mean : 28.43
3rd Qu.: 6457.8 3rd Qu.: 20.60 3rd Qu.: 20.60
Max. :370380.7 Max. :11111.03 Max. :11111.03

minPPrice rangePPrice totalQoH


Min. : 0.36 Min. :0 Min. : 0.0
1st Qu.: 7.18 1st Qu.:0 1st Qu.: 22.0
Median : 10.81 Median :0 Median : 160.0
Mean : 28.43 Mean :0 Mean : 554.3
3rd Qu.: 20.60 3rd Qu.:0 3rd Qu.: 550.0
Max. :11111.03 Max. :0 Max. :18396.0

These statistics indicate matters that would require further attention by


the auditor. In particular, all of the columns showing purchase prices (aver-
age, maximum and minimum) have the same values and, therefore, the price
ranges are zero. The results may indicate, for example, that the firm paid
fixed prices throughout the year for its products, or perhaps made only one
purchase of each brand type during the year. Also, one or more inventory
items has a zero cost (as indicated by the minimum 0.0 in the endInvCost
column). The auditor would assess whether these results make sense based
on his or her knowledge of the company and its industry, or whether they
may indicate, for example, a possible significant deficiency in internal control
over recording inventory costs.

Stratopoulos & Shields, Waterloo - 2018


4.4. MODELING 67

4.4 Modeling
Given our 3rd objective for our ADA stated in section 4.2, we need to focus
on products with average cost (acquisition cost) higher than average selling
price.3 This means that we need to compare average selling price with average
acquisition cost for each product available in the end of year inventory. To
make this comparison we will need to combine information from two separate
data sets (tables).

First: we need the average selling price from the table tSalesByProduct.
The following command shows the name of the variable for selling price in
tSalesByProduct is avgPrice.

> names(tSalesByProduct)

[1] "Brand" "sumQSales" "avgPrice" "maxPrice"


[5] "minPrice" "rangePrice"

Second: we need the average acquisition cost from the table tEndInvByProduct.
The name of the variable for acquisition cost is avgPPrice.

> names(tEndInvByProduct)

[1] "Brand" "endInvCost" "avgPPrice" "maxPPrice"


[5] "minPPrice" "rangePPrice" "totalQoH"

Third: we need to create a new data set that has all observations from the
inventory table (tEndInvByProduct) but only the matching records from
the sales table (tSalesByProduct). In SQL this type of join is called a LEFT
JOIN.

4.4.1 Create Comparison Data


To identify whether there are instances when average cost exceeds selling
price, we generate a new data set (dt_comp) that contains the new price vs
cost comparison variable (deltaCostPrice). The logic behind the creation
of the variable deltaCostPrice is as follows: If the average cost is less than
3
We are assuming that the use of average selling price to estimate net realizable value
is appropriate for Bibitor. In other circumstances, most recent selling prices, including
prices charged subsequent to the fiscal year end, might be appropriate for this purpose.

Stratopoulos & Shields, Waterloo - 2018


68 CHAPTER 4. INVENTORY & COST OF SALES

the average price the variable deltaCostPrice=0, else the deltaCostPrice


is equal to acquisition cost minus selling price.
More specifically, we will use the following CASE ... END structure to
capture the if-else statement.

CASE
WHEN avgPPrice<=avgPrice THEN 0
ELSE avgPPrice-avgPrice
END
The query below, can be divided into the following steps:
1. We perform a LEFT JOIN. The table tEndInvByProduct, end of year
inventory data, is positioned on left with an alias of b. The table
tSalesByProduct, sales data, is positioned on right with an alias of a.
2. The two tables are linked based ON the common field a.Brand=b.Brand.
3. We select the fields b.Brand, avgPPrice, avgPrice, sumQSales, and
totalQoH.
4. We use the CASE ... END to create the target variable deltaCostPrice.
5. Since, the summary statistics have shown that there are some products
that have an average selling price of zero, we limit our data to products
with positive price by specifying: WHERE avgPrice>0.

SQL Query
> dt_comp <- sqldf("SELECT b.Brand, avgPPrice, avgPrice,
sumQSales, totalQoH,
CASE
WHEN avgPPrice<=avgPrice THEN 0
ELSE avgPPrice-avgPrice
END AS deltaCostPrice
FROM tEndInvByProduct AS b LEFT JOIN tSalesByProduct AS a
ON a.Brand=b.Brand
WHERE avgPrice>0")

4.4.2 Review Comparison Data


Using the function nrow, which returns the number of rows in a data set, we
can see that the new data set has 8,181 observations.

> nrow(dt_comp)

[1] 8181

Stratopoulos & Shields, Waterloo - 2018


4.4. MODELING 69

An initial glance on the top (head) and bottom (tail) observations of


the data set does not show any product having a selling price that is less
than its average cost.

> head(dt_comp); tail(dt_comp)

Brand avgPPrice avgPrice sumQSales totalQoH deltaCostPrice


1 58 9.28 12.99000 3163 369 0
2 60 7.40 10.62240 1931 31 0
3 61 10.60 13.99000 281 12 0
4 62 28.67 38.29359 2997 466 0
5 63 30.46 40.28630 2498 401 0
6 70 13.96 24.24000 8 5 0

Brand avgPPrice avgPrice sumQSales totalQoH deltaCostPrice


8176 90088 92.46 134.99 10 49 0
8177 90089 77.92 119.99 142 169 0
8178 90090 448.27 649.99 6 45 0
8179 90604 78.42 119.99 54 47 0
8180 90609 17.00 24.99 24 296 0
8181 90631 12.74 18.99 156 372 0

To explore if any of the 8181 observations of the data set are associated
with products where the average inventory cost (acquisition cost) is higher
than the average selling price, we look at the descriptive statistics.

4.4.3 Summary Statistics


> options(scipen = 99)
> summary(dt_comp$deltaCostPrice)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.000000 0.000000 0.000000 0.002085 0.000000 6.245422

It seems that the vast majority of observations are very small.

4.4.4 Focus on Relatively Large Discrepancies


We focus on discrepancies which are more than 25 cents per unit. Using
the function nrow we can see that there are 10 observations that meet this
condition.

Stratopoulos & Shields, Waterloo - 2018


70 CHAPTER 4. INVENTORY & COST OF SALES

> dt_compFlag <- dt_comp[dt_comp$deltaCostPrice>0.25,]


> nrow(dt_compFlag)

[1] 10

We review all ten observations of the data set below.

> dt_compFlag

Brand avgPPrice avgPrice sumQSales totalQoH deltaCostPrice


953 2729 19.99 19.485327 515 471 0.5046729
1279 3375 12.59 11.377048 897 508 1.2129518
2537 7680 19.68 19.060588 397 366 0.6194118
4378 19138 6.71 3.330000 4 2 3.3800000
4861 20975 7.48 6.905842 1214 123 0.5741584
4872 21003 9.20 8.007208 2753 546 1.1927916
5242 22327 14.83 14.523333 60 9 0.3066667
6432 25588 20.26 14.014578 1649 1578 6.2454224
6570 25937 10.34 9.347778 536 475 0.9922222
7076 33331 10.73 9.056194 3301 336 1.6738061

4.5 Evaluation and Communication


The results of the ADA performed indicated a potentially significant issue re-
garding the accuracy of recording sales prices. For example, the sales price for
product 2696 appears to have varied between $31.99 and $13,999.90 (see the
summary statistics for outliers in section 4.3.1). The auditor would discuss
this matter with management and perform investigative work to determine
the root cause of the problem which might, for example, be a significant de-
ficiency (material weakness) in internal control. There was also the issue of
a lack of variation in purchase prices. Again, reasons for this would be ob-
tained, supported by audit evidence. The actions taken by the auditor would
depend on the results obtained as a result of more in-depth work. If there are
significant deficiencies (material weaknesses) in internal control, they would
be discussed with management and the audit committee. If there are per-
vasive errors, management would have to correct the records to make them
auditable. If the misstatements are due to fraud, then the auditor would eval-
uate the implications for the audit in accordance with the auditing standard
on fraud.

Stratopoulos & Shields, Waterloo - 2018


4.6. DOCUMENTATION 71

4.6 Documentation
An example audit working paper showing documentation for this ADA can
be accessed from the following url:
https://docs.google.com/document/d/1Xeq6ignMMxMgVT6Dg-6IINL8aLOHaWHMKvZtOAHaKpQ/
edit?usp=sharing.

Stratopoulos & Shields, Waterloo - 2018


72 CHAPTER 4. INVENTORY & COST OF SALES

4.7 Practice Problems


4.7.1 Most Expensive Products
Q: List the 15 most expensive items sold
We import the sales data and review the variable names.
> library(data.table)
> tSales <- fread("tSales.csv")
> names(tSales)
[1] "V1" "Store" "Brand"
[4] "SalesQuantity" "SalesPrice" "SalesDate"
[7] "ExciseTax"
We know that if we want to see the first 15 observations in the sales
data, we can use the function head(tSales, 15). If the data has been
ordered in descending order of SalesPrice, the head() will return the most
expensive products. We achieve this ordering using the function order().
More specifically, we can write order(-tSales$SalesPrice). The negative
sign in front of the variable indicates descending order. The default ascending
order is achieved without the negative sign. The following script will generate
the list of the 15 sales transactions during the year that reflect the highest
prices charged by Bibitor for its products.
> head(tSales[order(-tSales$SalesPrice),2:7],15)
Store Brand SalesQuantity SalesPrice SalesDate ExciseTax
1: 10 2696 1 13999.90 2015-10-13 1.84
2: 22 2696 1 13999.90 2015-10-13 1.84
3: 39 2696 1 13999.90 2015-10-13 1.84
4: 52 2696 1 13999.90 2015-10-13 1.84
5: 70 2696 1 13999.90 2015-10-13 1.84
6: 73 2696 1 13999.90 2015-10-13 1.84
7: 66 1991 1 4999.99 2015-07-21 0.79
8: 67 1991 1 4999.99 2015-07-21 0.79
9: 76 1991 1 4999.99 2015-07-23 0.79
10: 66 1991 1 4999.99 2015-12-02 0.79
11: 76 1991 1 4999.99 2015-12-08 0.79
12: 67 1991 1 4999.99 2016-01-27 0.79
13: 34 1991 1 4999.99 2016-02-07 0.79
14: 66 423 1 4696.99 2016-06-12 0.79
15: 69 423 1 4696.99 2016-06-13 0.79

Stratopoulos & Shields, Waterloo - 2018


4.7. PRACTICE PROBLEMS 73

Q: List the 15 most expensive items purchased


We import and review the purchase data.

> tPurchases <- fread("tPurchases.csv")


> names(tPurchases)

[1] "V1" "Store" "Brand"


[4] "PONumber" "ReceivingDate" "PurchasePrice"
[7] "Quantity"

Order data by purchase price (descending) and review the set of 15 observa-
tions.

> head(tPurchases[order(-tPurchases$PurchasePrice),2:7],15)

Store Brand PONumber ReceivingDate PurchasePrice Quantity


1: 34 2693 7349 2015-11-06 11111.03 1
2: 34 3949 8481 2016-01-22 5681.81 1
3: 66 3949 8481 2016-01-22 5681.81 1
4: 67 3949 8481 2016-01-22 5681.81 1
5: 34 2367 8661 2016-02-01 4264.70 1
6: 34 1991 5965 2015-07-27 3787.87 1
7: 66 1991 6172 2015-08-10 3787.87 1
8: 67 1991 6172 2015-08-10 3787.87 1
9: 76 1991 6825 2015-10-02 3787.87 1
10: 66 4423 7461 2015-11-10 3649.63 1
11: 34 4423 7461 2015-11-10 3649.63 1
12: 67 4423 7461 2015-11-10 3649.63 1
13: 73 423 6586 2015-09-11 3352.93 1
14: 34 423 6621 2015-09-18 3352.93 1
15: 66 423 10590 2016-06-13 3352.93 1

Notice that the most expensive product sold (Brand=2696) is not included
in the list of purchased products!

Q: List the 15 most expensive items in beginning inventory


Import and review variables in beginning inventory.

> tBegInv <- fread("tBegInv.csv")


> names(tBegInv)

Stratopoulos & Shields, Waterloo - 2018


74 CHAPTER 4. INVENTORY & COST OF SALES

[1] "V1" "Store" "Brand" "onHand" "PPrice"


[6] "startDate"

Order data in terms of inventory cost (Price descending) and review the
first 15 observations.

> head(tBegInv[order(-tBegInv$PPrice),2:6],15)

Store Brand onHand PPrice startDate


1: 66 2367 1 4264.70 2015-07-01
2: 67 2367 1 4264.70 2015-07-01
3: 66 1991 1 3787.87 2015-07-01
4: 67 1991 1 3787.87 2015-07-01
5: 76 1991 1 3787.87 2015-07-01
6: 34 423 1 3352.93 2015-07-01
7: 38 423 1 3352.93 2015-07-01
8: 66 423 1 3352.93 2015-07-01
9: 67 423 1 3352.93 2015-07-01
10: 69 423 1 3352.93 2015-07-01
11: 73 423 1 3352.93 2015-07-01
12: 76 423 1 3352.93 2015-07-01
13: 66 2996 1 3149.60 2015-07-01
14: 67 2996 1 3149.60 2015-07-01
15: 34 16191 1 2816.43 2015-07-01

Notice that the most expensive product sold (Brand=2696) is not included
in the list of beginning inventory products!

Q: List the 15 most expensive items in ending inventory


Import and review variables in ending inventory.

> tEndInv <- fread("tEndinv.csv")


> names(tEndInv)

[1] "V1" "Store" "Brand" "onHand" "PPrice" "endDate"

Order data in terms of inventory cost (Price descending) and review the
first 15 observations.

> head(tEndInv[order(-tEndInv$PPrice),2:6],15)

Stratopoulos & Shields, Waterloo - 2018


4.7. PRACTICE PROBLEMS 75

Store Brand onHand PPrice endDate


1: 34 2693 1 11111.03 2016-06-30
2: 34 3949 1 5681.81 2016-06-30
3: 66 3949 1 5681.81 2016-06-30
4: 67 3949 1 5681.81 2016-06-30
5: 34 2367 1 4264.70 2016-06-30
6: 66 2367 1 4264.70 2016-06-30
7: 67 2367 1 4264.70 2016-06-30
8: 34 4423 1 3649.63 2016-06-30
9: 66 4423 1 3649.63 2016-06-30
10: 67 4423 1 3649.63 2016-06-30
11: 34 423 1 3352.93 2016-06-30
12: 38 423 1 3352.93 2016-06-30
13: 66 423 1 3352.93 2016-06-30
14: 67 423 1 3352.93 2016-06-30
15: 73 423 1 3352.93 2016-06-30

Notice that the most expensive product sold (Brand=2696) is not included in
the list of ending inventory products!

Q: Find the most expensive item sold


Given the fact that the most expensive item sold, does not appear in the
list of most expensive items of purchased and inventory, we should take a
closer look at this product. We can limit the sales data to just the most
expensive products, by specifying that we want observations where the sales
price equals the most expensive price (i.e., max of sales price) as follows:
> tSales[tSales$SalesPrice==max(tSales$SalesPrice),]

V1 Store Brand SalesQuantity SalesPrice SalesDate


1: 3383390 10 2696 1 13999.9 2015-10-13
2: 3540057 22 2696 1 13999.9 2015-10-13
3: 3767407 39 2696 1 13999.9 2015-10-13
4: 3935029 52 2696 1 13999.9 2015-10-13
5: 4262893 70 2696 1 13999.9 2015-10-13
6: 4315075 73 2696 1 13999.9 2015-10-13

ExciseTax
1: 1.84
2: 1.84
3: 1.84

Stratopoulos & Shields, Waterloo - 2018


76 CHAPTER 4. INVENTORY & COST OF SALES

4: 1.84
5: 1.84
6: 1.84

Notice that the same product was sold on the same date in six different stores.

Q: Purchase prices of this product

We specify the subset that has only purchases of this product.

> tPurchases[tPurchases$Brand==2696,2:7]

Store Brand PONumber ReceivingDate PurchasePrice Quantity


1: 15 2696 5613 2015-07-03 25.19 6
2: 6 2696 5613 2015-07-04 25.19 6
3: 38 2696 5613 2015-07-02 25.19 6
4: 67 2696 5613 2015-07-02 25.19 6
5: 68 2696 5613 2015-07-02 25.19 6
---
646: 66 2696 10820 2016-06-28 25.19 6
647: 67 2696 10820 2016-06-28 25.19 6
648: 21 2696 10820 2016-06-28 25.19 5
649: 69 2696 10820 2016-06-30 25.19 6
650: 38 2696 10820 2016-06-30 25.19 6

The data set shows that there were 650 purchase line items and they seem
to be priced at just $25.19. To get a better feeling about the distribution
of the purchase prices we run summary statistics for just these observations.
Notice that the variables is specified using the the $ sign after the closing of
the square bracket.

> summary(tPurchases[tPurchases$Brand==2696,]$PurchasePrice)

Min. 1st Qu. Median Mean 3rd Qu. Max.


25.19 25.19 25.19 25.19 25.19 25.19

Summary statistics show that the purchase price is constant at $25.19. All
descriptive statistics are the same.

Stratopoulos & Shields, Waterloo - 2018


4.7. PRACTICE PROBLEMS 77

Q: Beginning inventory cost for this product


We perform the same analysis with beginning inventory.

> tBegInv[tBegInv$Brand==2696,2:6]

Store Brand onHand PPrice startDate


1: 1 2696 4 25.19 2015-07-01
2: 2 2696 9 25.19 2015-07-01
3: 4 2696 7 25.19 2015-07-01
...
Lines for other stores have been deleted for brevity.
While quantity on hand varies from store to store,
the price is 25.19 across all stores.
...
73: 76 2696 19 25.19 2015-07-01
74: 77 2696 8 25.19 2015-07-01
75: 78 2696 6 25.19 2015-07-01
Store Brand onHand PPrice startDate

> summary(tBegInv[tBegInv$Brand==2696,]$Price)

Min. 1st Qu. Median Mean 3rd Qu. Max.


25.19 25.19 25.19 25.19 25.19 25.19

Summary statistics show that the inventory cost for product 2696 is constant
at $ 25.19 across all stores.

Q: End inventory cost for this product


We perform the same analysis with ending inventory data.

> tEndInv[tEndInv$Brand==2696,]

V1 Store Brand onHand PPrice endDate


1: 462 1 2696 2 25.19 2016-06-30
2: 3446 2 2696 8 25.19 2016-06-30
3: 6605 4 2696 5 25.19 2016-06-30
...
Lines for other stores have been deleted for brevity.
While quantity on hand varies from store to store,
the price is 25.19 across all stores.
...

Stratopoulos & Shields, Waterloo - 2018


78 CHAPTER 4. INVENTORY & COST OF SALES

73: 202205 76 2696 17 25.19 2016-06-30


74: 207172 77 2696 7 25.19 2016-06-30
75: 209390 78 2696 5 25.19 2016-06-30
76: 210929 79 2696 8 25.19 2016-06-30
V1 Store Brand onHand PPrice endDate

> summary(tEndInv[tEndInv$Brand==2696,2:6]$Price)

Min. 1st Qu. Median Mean 3rd Qu. Max.


25.19 25.19 25.19 25.19 25.19 25.19

Summary statistics show that the inventory cost for product 2696 is constant
at $25.19 across all stores.

Communication Practice Problem


What would you communicate to management regarding the ADA results
regarding product 2696, taking into account that the results indicate either
error or fraud? When would you communicate with management? Would
you communicate with the audit committee?

Stratopoulos & Shields, Waterloo - 2018


Chapter 5

Payroll

Learning Objectives
By the end of this chapter trainees should have learned how to design, per-
form, evaluate, and communicate the results of ADA to obtain an under-
standing of an entity’s payroll expenses and assess risks of material misstate-
ment of those expenses. This chapter builds on matters noted in chapters 1
– 3, by discussing how to use R including SQL queries, to:

1. Load/import and review payroll datasets provided by the client in .csv


format.
2. Reformat certain data when necessary.
3. Combine information in various tables to enable performance of the
ADA.
4. Verify that the data relate to the appropriate period.
5. Develop summary statistics for both salaried and wage employees and
assess whether data, and overall payroll costs, fall within ranges ex-
pected by the auditor.
6. Create time series graphs to visualize trends in payroll costs.
7. Determine if there are duplicate payroll payments to the same employee
in a given pay period.

5.1 Information Regarding Bibitor’s Payroll


This chapter focuses on the audit of Bibitor’s payroll expense in its annual
financial statements. The following is brief background information related
the company’s payroll.

79

Stratopoulos & Shields, Waterloo - 2018


80 CHAPTER 5. PAYROLL

1. Payroll costs are shown in Bibitor’s preliminary annual financial state-


ments as “Personnel Services” expenses of $23.8 million.
2. Bibitor expenses payroll costs when incurred. The company buys all
the products it sells and therefore, for example, no payroll costs are
included in cost of inventory.
3. Bibitor has approximately 150 salaried employees and 400 employees
paid on an hourly base. Rates per hour and hours worked vary based,
for example, on the type of position, seniority and whether the staff
member is full time or part time.
4. Employees are paid bi-weekly.
5. Employees are located at the head office and in each of Bibitor’s 79
stores. Each store manager is responsible for managing the staff in his
or her store.
6. See Bibitor Payroll Process Manual for approval processes regarding
hiring, terminations etc.1
7. See the Bibitor Payroll Flowcharts describing Bibitor’s Digital Time
Card (DTC) and DTC Approval Payment Processes.
8. The company uses ADP, a third party payroll service, to process the
payroll and issue checks. The company has implemented controls over
the submission and receipt of information from ADP as set out in the
Payroll Process Manual.

5.2 ADA Objectives


The ADA in this chapter are being used in assessing risks of material mis-
statement in Bibitor’s payroll expense. Misstatements in payroll may be the
result of errors or fraud, and may be overstatements or understatements. A
few examples of types of possible misstatements include:

• Staff being paid for hours they did not work.


• Incorrect rates being used to calculate paychecks and deductions.
• Incorrect application of rates to hours worked.
• Unauthorized changes in pay rates.
• Duplicate payments to the same individual.
• Checks being issued to individuals when they are no longer employees.
• Checks being paid to fictitious employees.
• Amounts deducted for taxes etc. that are not remitted to the appro-
priate agencies.
1
Login to http://www.hubae.org/ with your HUB account to access Bibitor’s Payroll
Process Manual.

Stratopoulos & Shields, Waterloo - 2018


5.3. DATA UNDERSTANDING 81

• Amounts recorded in the GL that do not reconcile with amounts in the


database files and tables for payroll.
Assertions addressed in auditing payroll expenses include occurrence,
completeness, accuracy and cut-off.
This chapter discusses using basic ADA in assessing risks of material mis-
statements in payroll expense. These ADA may also provide audit evidence
to corroborate or contradict audit evidence regarding payroll expenses ob-
tained from other sources. ADA might also be used as substantive procedures
to detect payroll misstatements. Often, this might often be more complex.
For example, when ADA are used as substantive procedures, the auditor
would be required to establish the reliability of the data used in the ADA.
This could entail testing the operating effectiveness of relevant controls over
personnel records and the payroll authorization and payments process. The
auditor might also perform tests of details of relevant data. These types of
matters are beyond the scope of the material discussed in this chapter.

5.3 Data Understanding


Based on our prior experience with the client’s data, we are familiar with
the overall structure of the company’s relational data base. To achieve the
objectives of our ADA used in auditing Bibitor’s payroll expense, we will
need data from the three tables in the company’s data base as shown in
figure 5.1:
1. tStores. The primary key in this table is Store. That is, each store
has a unique identifying number.
2. tEmployee. The primary key in this table is EmpID. That is, each
employee has a unique identifying number. Store is a foreign key in
this table. This enables us to identify which employees work in each of
Bibitor’s 79 stores, or head office.
3. tPayroll. The primary key in this table is payroll_ID. That is, each
payroll transaction has a unique identifying number. EmpID is a for-
eign key in this table. This enables us to link each payroll transaction
to a particular employee.
The nature of the data contained in these files is shown in figure 5.1.
Much of this data will be relevant to the performance of our payroll ADA.
To streamline the process and reduce inefficiencies, we will start by loading
all three files and then combining them into one file which we will use for
audit purposes. This way, all subsequent analysis can be done by accessing
a single file.

Stratopoulos & Shields, Waterloo - 2018


82 CHAPTER 5. PAYROLL

Figure 5.1: Bibitor: Payroll Related Database Segment

5.3.1 Load and Review Payroll Data


Following directions shown in section 1.3, we create a directory to store our
data and analysis, download and save the files in this directory, and create a
new R file and name it BBTR_payroll. Using the following script, we import
and review the data in the table tPayroll.csv into a new data set named
dtPayroll.2

> library(data.table)
> dtPayroll <- fread("tPayroll.csv")
> names(dtPayroll)

[1] "V1" "PayPeriod" "BegDate"


[4] "EndDate" "empID" "RatePerHour"
[7] "Salary" "SalaryWagesPerPP" "Hours"
[10] "FedTax" "StTax" "FICA"
[13] "EmFica" "Medicare" "EmMedicare"

> names(dtPayroll)[1] <- "payroll_ID"


> str(dtPayroll)

Classes 'data.table' and 'data.frame':


13557 obs. of 15 variables:
2
The complete R Script used for the creation of this chapter is available from the
following URL: https://goo.gl/qiPCJU

Stratopoulos & Shields, Waterloo - 2018


5.3. DATA UNDERSTANDING 83

$ payroll_ID : chr "1" "2" "3" "4" ...


$ PayPeriod : int 1 1 1 1 1 1 1 1 1 1 ...
$ BegDate : chr "2015-07-01" "2015-07-01" ...
$ EndDate : chr "2015-07-15" "2015-07-15" ...
$ empID : chr "04-33895" "04-43490" ...
$ RatePerHour : num NA NA NA NA NA NA NA NA NA NA ...
$ Salary : num 10456 14982 65641 41590 48064 ...
$ SalaryWagesPerPP: num 436 624 2735 1733 2003 ...
$ Hours : int NA NA NA NA NA NA NA NA NA NA ...
$ FedTax : num 43.6 62.4 558 237.4 408.6 ...
$ StTax : num 22.7 32.5 142.2 90.1 104.1 ...
$ FICA : num 27 38.7 169.6 107.4 124.2 ...
$ EmFica : num 27 38.7 169.6 107.4 124.2 ...
$ Medicare : num 6.32 9.05 39.66 25.13 29.04 ...
$ EmMedicare : num 6.32 9.05 39.66 25.13 29.04 ...

Each observation captures the payroll-related information for each em-


ployee and for each payperiod.3

Format Dates
The review of payroll data has shown that dates are formatted as characters
(chr). This is going to be problematic if we want to use these dates for the
creation of graphs and data aggregation. Following directions from section
2.4, we convert these two variables to as.Date format and review the new
structure as follows:

> dtPayroll$BegDate <- as.Date(dtPayroll$BegDate, "%Y-%m-%d")


> dtPayroll$EndDate <- as.Date(dtPayroll$EndDate, "%Y-%m-%d")
> str(dtPayroll)

...
$ BegDate : Date, format: "2015-07-01" ...
$ EndDate : Date, format: "2015-07-15" ...
...

Verify that beginning payroll dates are within the fiscal year
We can verify this by ordering our data in ascending and descending order of
payroll beginning dates (BegDate) and viewing the top three observations,
as follows:
3
FICA stands for Federal Insurance Contributions Act.

Stratopoulos & Shields, Waterloo - 2018


84 CHAPTER 5. PAYROLL

> head(dtPayroll[order(dtPayroll$BegDate),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 04-33895 NA 10456.39
2: 2015-07-01 2015-07-15 04-43490 NA 14982.28
3: 2015-07-01 2015-07-15 04-61555 NA 65640.72
SalaryWagesPerPP Hours
1: 435.6829 NA
2: 624.2617 NA
3: 2735.0300 NA

> head(dtPayroll[order(-dtPayroll$BegDate),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2016-06-16 2016-06-30 04-33895 NA 10456.39
2: 2016-06-16 2016-06-30 04-43490 NA 14982.28
3: 2016-06-16 2016-06-30 04-61555 NA 65640.72
SalaryWagesPerPP Hours
1: 435.6829 NA
2: 624.2617 NA
3: 2735.0300 NA

The 3 rows in the first table above show data for the earliest payroll
period included in the dataset. As we would expect, the first pay period is
for the first two weeks in July 2015. The employee numbers for the first 3
employees paid in the period (by employee ID number) are noted. The Salary
column shows the total salary for the respective employees for the year. The
amount paid (again as we would expect) is the annual salary divided by 24
(the number of pay periods during the year).
Since these are salaried employees, the rate per hour, and hours worked
columns are blank. The 3 rows in the second table above show similar data
for the last payroll period included in the dataset. The observations in the
two tables show that the first payroll period starts on July 1, 2015 and the
last payroll period ends on June 30, 2016. Therefore, the data being used for
the ADA all fall within Bibitor’s 2016 fiscal year.

Are there any employees with more than one paycheck in one pay
period?
Early in performing the ADA, it may be useful to determine if there is any em-
ployee who has obtained more than one paycheck in a bi-weekly pay period.

Stratopoulos & Shields, Waterloo - 2018


5.3. DATA UNDERSTANDING 85

The reason is that the existence of multiple payments (whether authorized


or not) could affect other aspects of the analysis.
Absent rare circumstances, there should be one paycheck per employee
per pay period. Therefore, if we count the number of paychecks for each
employee/period we should not find any employee who has more than one.
We can test for this by creating a query that aggregates (counts employee
IDs per pay period) and imposes the constraint that count of employee ID is
greater than one, as follows:4

> library(sqldf)
> sqldf("SELECT PayPeriod, empID, count(DISTINCT empID)
FROM dtPayroll GROUP BY PayPeriod, empID
HAVING count(empID)>1")

[1] PayPeriod empID


[3] count(DISTINCT empID)
<0 rows> (or 0-length row.names)

The query generate zero rows, which means that in our data set there are no
employees who received more than one paycheck in one period.

Practice Problems: Verify Data


Use the same approach to verify the following topics:
1. Verify that ending payroll dates are within the fiscal year.5
2. Review range of salaries.6
3. Review range of hourly rates.7
4. Review range of hours.8

5.3.2 Load and Review Data from Employee Table


The ADA will involve the use of data from the table tEmployees. Using the
following script, we import and review the data from that table (tEmploy-
ees.csv).

> dtEmployees <- fread("tEmployees.csv")


> names(dtEmployees)
4
See section 2.3 on how to create aggregate queries.
5
The solution to this practice problem is on p. 102
6
The solution to this practice problem is on p. 102
7
The solution to this practice problem is on p. 103
8
The solution to this practice problem is on p. 103

Stratopoulos & Shields, Waterloo - 2018


86 CHAPTER 5. PAYROLL

[1] "emplID" "FirstName" "LastName" "Store" "Title"

> str(dtEmployees)

Classes 'data.table' and 'data.frame':


1120 obs. of 5 variables:
$ emplID : chr "04-33895" "04-43490" "04-61555" ...
$ FirstName: chr "Blanca" "Alexia" "Nicolas" "Brian" ...
$ LastName : chr "Michl" "Odhrani" "Smith" "Kumar" ...
$ Store : int 0 0 0 0 0 0 0 0 0 0 ...
$ Title : chr "CLERK" "BUILDING SERVICE WORKER" ...

Based on the above, Bibitor has 1120 employees and for each employee
we have the following information: unique identification number, first and
last name, their store, and title. Please note that the firm uses store zero (0)
to indicate the firm’s headquarters.
Similar to the approach taken above regarding payroll dates, we can de-
termine that our data includes that for all locations (0 to 79) by ordering the
data. The resulting tables are set out below.

> head(dtEmployees[order(dtEmployees$Store),],3)

emplID FirstName LastName Store Title


1: 04-33895 Blanca Michl 0 CLERK
2: 04-43490 Alexia Odhrani 0 BUILDING SERVICE WORKER
3: 04-61555 Nicolas Smith 0 REGIONAL STORE SUPERVISOR

> head(dtEmployees[order(-dtEmployees$Store),],3)

emplID FirstName LastName Store Title


1: 00-79898 Meghna Lopiccolo 79 Retail Store Manager III
2: 01-79898 Olivia Machado 79 Retail Store Clerk III
3: 01-25735 Victoria Attari 79 Retail Store Clerk III

Headcount: HQ vs. stores


To generate the headcount at headquarters and stores, we use an ifelse
statement to distinguish employees based on their location as follows:
> dtEmployees$location <-
ifelse(dtEmployees$Store>0, "stores", "HQ")
Leveraging the new variable (location) we create an aggregate query that
provides the head count by location, as follows:

Stratopoulos & Shields, Waterloo - 2018


5.3. DATA UNDERSTANDING 87

> sqldf("SELECT location, count(DISTINCT emplID) AS emplCount


FROM dtEmployees GROUP BY location")

location emplCount
1 HQ 61
2 stores 1059

Practice Problem: List of job titles


1. Create a table that shows all different job titles in Bibitor and the
headcount in each job. Hint: Use the function table() in R.9
2. Create a table that shows the percentage of employees in each job.

5.3.3 Load and Review Store Data


Using the following script, we import and review the data in the table
tStores.csv.
> dtStores <- fread("tStores.csv")
> names(dtStores)

[1] "Store" "City" "Location" "SqFt"

> nrow(dtStores)

[1] 79

5.3.4 Create Combined Data for ADA


To perform the ADA we need all observations from the payroll file and match-
ing information that will let us add to this information regarding employee
location and store size. While it is possible to combine all three files into one
file using a LEFT JOIN in a single step, we will break this into to steps.10
First, we will add store number and employee title from the employee
table into the payroll file. We name the combined data file dt1.

> dt1 <- sqldf("SELECT a.*, Store, Title


FROM dtPayroll AS a LEFT JOIN dtEmployees AS b
ON a.empID=b.emplID")
> names(dt1)
9
The solution to this practice problem is on p. 104.
10
The objective in a left join is to keep all observations from the table listed on the left
and all matching observations from the table listed on the right. For a review of LEFT
JOIN see section 2.6.3.

Stratopoulos & Shields, Waterloo - 2018


88 CHAPTER 5. PAYROLL

[1] "payroll_ID" "PayPeriod" "BegDate"


[4] "EndDate" "empID" "RatePerHour"
[7] "Salary" "SalaryWagesPerPP" "Hours"
[10] "FedTax" "StTax" "FICA"
[13] "EmFica" "Medicare" "EmMedicare"
[16] "Store" "Title"

> nrow(dt1)

[1] 13557

Second, we will add the store size (SqFt) from the store file into the combined
file (dt1 ) created in the previous step.

> dt1 <- sqldf("SELECT a.*, SqFt


FROM dt1 AS a LEFT JOIN dtStores AS b
ON a.Store=b.Store")
> names(dt1)

[1] "payroll_ID" "PayPeriod" "BegDate"


[4] "EndDate" "empID" "RatePerHour"
[7] "Salary" "SalaryWagesPerPP" "Hours"
[10] "FedTax" "StTax" "FICA"
[13] "EmFica" "Medicare" "EmMedicare"
[16] "Store" "Title" "SqFt"

> nrow(dt1)

[1] 13557

5.4 ADA Modeling


As part of obtaining an understanding of Bibitor’s business, the auditor
would make inquiries of Bibitor’s management regarding whether there have
been any significant changes in matters affecting payroll in the fiscal year
being audited. Management’s response might be, for example, that there
have been no significant changes. The auditor can use ADA to corroborate
or refute management’s response as part of the assessment of risk of material
misstatement of payroll expenses. The ADA will therefore entail obtain-
ing and analyzing information on matters such as number and locations of
employees, salary levels, hours worked and wage rates paid.

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 89

5.4.1 Salaried and Hourly Employees


Objectives: To obtain an overall profile of the key attributes of salaried
employees and hourly employees. The auditor’s statistics will be compared
with the auditor’s expectations based on the results of the same analysis
performed in the previous year. For example:

• For salaried employees, based on the auditor’s understanding of the


company’s business, the auditor may anticipate only small changes in
the total number of employees in each group. If there are significant
increases or decreases in the number of employees in each group, this re-
sult would warrant follow up procedures, likely including more detailed
analyses.
• For salaried employees, the auditor will anticipate that there will be a
wide range of salaries between the maximum (e.g., for the CEO) and the
minimum (e.g., for a part-time mail clerk). However, if the maximum,
minimum, or other statistics are significantly different from what the
auditor anticipates, this result would warrant follow up procedures,
likely including more detailed analyses.

Desired output would be summary statistics showing min, max, median


and quartiles for salaried employees and, separately, for hourly employees the
total hours they worked for the year and the total amounts they received in
pay for the year. Normally this would be gross pay. This would, of course,
involve aggregating data. (Note - this shows the power of ADA. This would
not be practicable to do manually).
To proceed with our analysis, we create a new variable (status) that distin-
guishes between hourly (hourlyEmpl) and salaried employees (salariedEmpl).
To make this distinction, we leverage the fact that in the payroll file, the
salary field is missing (NA) for hourly employees.

> dt1$status <-


ifelse(is.na(dt1$Salary), "hourlyEmpl", "salariedEmpl")
> summary(dt1[dt1$status=="salariedEmpl",c(7,8)])

Salary SalaryWagesPerPP
Min. : 10346 Min. : 431.1
1st Qu.: 39780 1st Qu.:1657.5
Median : 45950 Median :1914.6
Mean : 47687 Mean :1986.9
3rd Qu.: 52290 3rd Qu.:2178.8
Max. :107823 Max. :4492.6

Stratopoulos & Shields, Waterloo - 2018


90 CHAPTER 5. PAYROLL

> summary(dt1[dt1$status=="hourlyEmpl",c(6,8,9)])

RatePerHour SalaryWagesPerPP Hours


Min. :11.67 Min. : 70.02 Min. : 6.00
1st Qu.:12.58 1st Qu.: 928.83 1st Qu.:68.00
Median :13.93 Median :1061.06 Median :75.00
Mean :14.24 Mean :1034.23 Mean :71.87
3rd Qu.:15.87 3rd Qu.:1201.63 3rd Qu.:82.00
Max. :17.98 Max. :1582.24 Max. :88.00

The results shown in the above tables provide information that may be
useful in identifying outliers. For example, based on the results of audits of
previous years, and updated information obtained by inquires of management
in the current year, the auditor would have expectations regarding what the
summary statistics should look like. If the statistics obtained were to differ
significantly from what the auditor expected, the auditor would perform
further procedures to determine why there are significant variances from
expectations.
This output also provides information on the accuracy of the calculation
of payroll. For example, for salaried employees, the total salary in each row in
the first table divided by the number of pay periods (24) equals the amount
paid in each pay period. For wage employees, in row 1 of the second table,
the minimum hourly rate ($11.67) times the minimum hours (6.0) equals the
minimum amount paid in a pay period ($70.02). Similarly, in the last row,
the maximum hourly rate ($17.98) times the maximum hours (88.0) equals
the maximum amount paid in a pay period ($1,582.24).

Focus on Hourly Employees


In our payroll analysis of hourly employees, we need to examine total hours,
hourly rates, and total wages paid. We create a new data set that captures
this information by limiting our observations to employees that have a status
of hourly employee (hourlyEmpl) and by grouping data at the employee level
as follows:

> dt1_a <- sqldf("SELECT empID, sum(Hours) AS totalHours,


avg(RatePerHour) AS avgRateHour,
sum(SalaryWagesPerPP) AS totalWages
FROM dt1 WHERE status = 'hourlyEmpl'
GROUP BY empID")
> str(dt1_a)

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 91

'data.frame': 963 obs. of 4 variables:


$ empID : chr "01-11175" "01-11204" "01-12098" ...
$ totalHours : int 1811 1821 1844 1819 1810 1868 ...
$ avgRateHour: num 11.7 15.6 17.8 12.3 14.2 ...
$ totalWages : num 21261 28444 32842 22319 25774 ...

The new data set has 963 observations. An analysis of staff by position
would show that these represent the 884 retail store clerks and 79 receivers
who are paid hourly (see Practice Problem 5.3.2). There are 96 retail store
managers who are salaried employees. Therefore, the total number of staff
located in stores is 1059 (963 + 96), consistent with our earlier analysis.

> summary(dt1_a[,2:4])

totalHours avgRateHour totalWages


Min. : 9.0 Min. :11.67 Min. : 105.0
1st Qu.: 34.0 1st Qu.:11.67 1st Qu.: 396.8
Median : 74.0 Median :11.67 Median : 863.6
Mean : 731.3 Mean :12.76 Mean :10523.8
3rd Qu.:1810.0 3rd Qu.:13.46 3rd Qu.:24419.3
Max. :1917.0 Max. :17.98 Max. :33613.6

The results show that the distribution of hours worked is widely spread
and reflect two groups. Employees in the top quartile seem to work full time
(around 40 hours/week times 50 weeks produces around 2,000 hours per
year). The remaining hourly employees seem to work one or two weeks per
year. This most likely occurs during the weeks of peak sales (e.g., Christmas).
This same pattern is reflected in total wages paid.

Calculation of Total of Hours, Wages, and Salaries Paid


With the following query, we generate the total employee count, as well as
total hours, average hourly wage, and total wages paid for hourly employees.
Notice that this is an aggregate query (same as the one above) that does not
use a GROUP BY statement.

> sqldf("SELECT count(DISTINCT empID) AS emplCount,


sum(Hours) AS totalHours,
avg(RatePerHour) AS avgRateHour,
sum(SalaryWagesPerPP) AS totalWages
FROM dt1 WHERE status = 'hourlyEmpl'")

Stratopoulos & Shields, Waterloo - 2018


92 CHAPTER 5. PAYROLL

emplCount totalHours avgRateHour totalWages


1 963 704283 14.24055 10134431

Using a similar approach, we genereate employee count and total salaries


paid to non-hourly employees as follows:

> sqldf("SELECT count(DISTINCT empID) AS emplCount,


sum(SalaryWagesPerPP) AS totalSalaries
FROM dt1 WHERE status = 'salariedEmpl'")

emplCount totalSalaries
1 157 7466950

The total of wages and salaries paid per our analysis is $17,601,381
($10,134,431 + $7,466,950). If we were to perform similar calculations for the
various payroll deductions, we might find that they total $6,164,722. In that
case, our total calculated payroll expense would be $23,766,103 which would
agree with the amount shown as “personnel services expense” in Bibitor’s
2016 income statement. If the amounts did not agree, the auditor would
determine why that was the case. However, this ADA would provide only
some limited evidence of the accuracy of the process to sum payroll amounts
and record them in the GL. Evidence from other procedures, including ADA,
would be needed to address the various assertions related to payroll. For ex-
ample, as noted earlier, procedures would need to be performed to verify the
reliability of the data used in the payroll process.

Practice Problem: Deductions


1. Recalculate deductions (as well as wages paid) and reconcile results to
total payroll expense shown on the income statements.

5.4.2 Bi-Weekly Wages


Objectives: Auditors commonly review data at a disaggregated level (e.g.,
not just totals for a year). This helps them assess, for example, whether
there are relatively small misstatements that could aggregate to a material
misstatement. Bibitor’s staff are paid on a bi-weekly basis. Therefore, it
would be useful to (1) determine the number of employees paid, and total
amounts paid to them, in each bi-weekly period (separately for hourly and
salaried employees), and (2) plot the results on a graph.
The auditor would likely anticipate, for example, that the bi-weekly
hourly pay totals for late November and December would be higher than for

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 93

the rest of the year because of longer store hours to accommodate increasing
demand for products in the holiday season. If this or another anticipated
fluctuation is not reflected in the results, the auditor would perform further
procedures to obtain information on why this occurred.
We create a new data set that aggregates (GROUP BY) payroll information,
such as total payments and employee count, at the period and status level,
as follows:

> dt1_b <- sqldf("SELECT BegDate, status,


sum(SalaryWagesPerPP) AS totalPmts,
count(DISTINCT empID) AS emplCount
FROM dt1 GROUP BY BegDate, status")

We generate summary statistics for each group (hourly versus salaried


employees) by subseting based on the variable status.

> summary(dt1_b[dt1_b$status=="hourlyEmpl",3:4])

totalPmts emplCount
Min. :404182 Min. :368.0
1st Qu.:410950 1st Qu.:372.0
Median :412980 Median :380.0
Mean :422268 Mean :408.3
3rd Qu.:420146 3rd Qu.:404.2
Max. :535426 Max. :713.0

> summary(dt1_b[dt1_b$status=="salariedEmpl",3:4])

totalPmts emplCount
Min. :310087 Min. :156.0
1st Qu.:310087 1st Qu.:156.0
Median :311863 Median :157.0
Mean :311123 Mean :156.6
3rd Qu.:311863 3rd Qu.:157.0
Max. :311863 Max. :157.0

As expected, payments and employee count of salaried employees are


practically unchanged across pay periods. However, there is a range of values
for hourly employees. The former reflects the fixed nature of salary cost while
the latter reflects the variable nature of wage cost.
When dealing with time series data, the creation of a time series graph is
the best way to visualize the underlying trend. Given the fact that there is

Stratopoulos & Shields, Waterloo - 2018


94 CHAPTER 5. PAYROLL

no variance on salaried employee (count and payment), we will create graphs


only for hourly employees. Following the approach shown in section 3.4.2,
we leverage the function ggplot to create the time series graph of employee
count by pay period.11

> library(ggplot2)
> g4_emplCount <-
ggplot(dt1_b[dt1_b$status=="hourlyEmpl",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")+
scale_x_date(date_breaks = "1 month")
> g4_emplCount

The graph (Figure 5.2) shows that the number of hourly employees spikes
in July and December. This pattern in consistent with the prior analysis of
sales data (see Figure 3.3).

Figure 5.2: Bi-Weekly Hourly Employee Count

Following a similar approach, we examine the time series pattern of total


payments (wages) to hourly workers. The commands are shown below and
the resulting graph is captured in Figure 5.3. The graph shows a similar
pattern as the one shown in employee count (Fig. 5.2) and sales (Fig. 3.3).
That is, there are spikes in July and December.
11
Notice that in order to make the horizontal axis more readable, we have used the ad-
ditional command scale_x_date ( date_breaks = "1 month" ). This command spec-
ifies that the input is formatted as a date and we would like to have a break per month.
According to R help (type ?date_break in the console), we can specify "sec", "min", "hour",
"day", "week", "month", or "year". We can specify multiple periods, e.g., two months as "2
months".

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 95

> g4_totalPmts <-


ggplot(dt1_b[dt1_b$status=="hourlyEmpl",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")+
scale_x_date(date_breaks = "1 month")
> g4_totalPmts

Figure 5.3: Bi-Weekly Payment Total Wages

5.4.3 Bi-Weekly Wages: Store Level


Objectives: Disaggregating the data to the store level is useful because
under Bibitor’s policies and procedures, hiring, firing etc. is done by man-
agers at the store level (with some exceptions). Again, disaggregating data
in this way helps to identify the possibility of smaller misstatements that
could aggregate to material misstatements.
An unusually high or low amount of payroll expense by store might in-
dicate a need for further work by the auditor. What is “unusual” would be
determined by the auditor based on the auditor’s knowledge of the company.
We create a new data set (dt1_c) based on an aggregate query that
provides store level payroll data for each payperiod.

> dt1_c <- sqldf("SELECT Store, BegDate, status, SqFt,


sum(SalaryWagesPerPP) AS totalPmts,
sum(Hours) AS totalHours,
count(DISTINCT empID) AS emplCount
FROM dt1 WHERE status = 'hourlyEmpl'
GROUP BY Store, BegDate")
> head(dt1_c)

Stratopoulos & Shields, Waterloo - 2018


96 CHAPTER 5. PAYROLL

Store BegDate status SqFt totalPmts totalHours


1 1 2015-07-01 hourlyEmpl 8300 6430.67 442
2 1 2015-07-16 hourlyEmpl 8300 7714.34 556
3 1 2015-08-01 hourlyEmpl 8300 6414.49 447
4 1 2015-08-16 hourlyEmpl 8300 6980.60 483
5 1 2015-09-01 hourlyEmpl 8300 6371.24 441
6 1 2015-09-16 hourlyEmpl 8300 6590.71 464
emplCount
1 6
2 9
3 6
4 6
5 6
6 6

Visualize Bi-Weekly Employee Count per Store


Given the large number of stores, it would be tedious to generate and review
79 graphs (one per store). A way to simplify the process is to view multiple
stores in the same graph. With the following script, we will generate the
graph showing the employee count for stores 1 through 20 as follows:
First, we specify that the data set is limited to stores having a store id less
or equal to 20 (dt1_c[dt1_c$Store<=20,]).12 Second, we use the function
facet_wrap to create a separate graph for each store (group data by store).
Notice that the function takes two arguments: the variable that specifies the
grouping (∼Store) and the number of rows (nrow = 5). Given that there
are 20 stores (graphs) and five rows, the resulting graph will have four graphs
per row.

> g4_emplCountStore <-


ggplot(dt1_c[dt1_c$Store<=20,], aes(x=BegDate)) +
geom_point(aes(y=emplCount)) +
geom_line(aes(y=emplCount)) +
xlab("Beginning Date") + ylab("Employee Count")
> g4_emplCountStore + facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/yw1UjG. The auditor would look at the resulting graphs (and similar
graphs generated for the other stores) to identify those stores, if any, having
unusual employee counts. For example, these might be employee counts that
12
See section 1.4.1 to review the instructions on how to create subsets.

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 97

that are significantly higher or lower than those of other stores. Also, some
stores might have employee counts that vary significantly from what might
be considered a normal pattern during the pay period cycle. For example,
looking at the graph generated for stores 1-20, we might expect that virtually
all stores would hire more staff (i.e., increase the employee count) in the peak
of the summer season and near the end of the calendar year (that for many
people is the “Holiday Season”). It would be reasonable to expect increased
consumption of alcoholic beverages in those periods compared to other times
during the year, with a resulting need to hire more hourly staff to help
maintain a high quality of customer service.
Looking at the graphs generated for each store, we see, for example,
that virtually all the stores hire more staff at the start of the year and for
the holiday season. However, there are exceptions. For example, store 3
only increases its staff complement in autumn and store 12 does not hire
more staff for the holiday season. There are legitimate reasons why such
exceptions may occur. For example, the community served by these two
stores may consist mainly of people who do not participate in the “holiday
season.” On the other hand, there may, for example, be a risk that these stores
do, in fact, increase their staff compliment in the holiday season but this is
done by means of making cash payments outside the normal payroll system.
This may indicate matters that are qualitatively material (e.g., violations of
employment, tax and other laws). The auditor may therefore make inquiries
about the reasons for these (and other unusual trends identified) and obtain
evidence to support, or contradict, management’s responses to the inquiries.

Practice Problems: Employee Count By Store


Generate the bi-weekly employee count per store for the following groups of
stores:13

1. Stores 21 through 40.


2. Stores 41 through 60.
3. Stores 61 through 79.

Visualize Bi-Weekly Total Wages per Store


It may also be useful to use a similar approach to develop graphics that show
bi-weekly total wages per store. The patterns obtained may be similar to
those obtained from the analysis of bi-weekly hours worked. However, vari-
ations are possible because, for example, wage rates may vary among stores
13
The solution to this practice problem is on p. 105

Stratopoulos & Shields, Waterloo - 2018


98 CHAPTER 5. PAYROLL

because staff have different pay grades (e.g., some stores may, on average,
have more senior employees than other stores).
With the following script we generate the graphs for stores 1 through
20. The resulting graphs can be accessed from the following link: https:
//goo.gl/o7bB2F.
> g4_totalPmtsStore <-
ggplot(dt1_c[dt1_c$Store<=20,], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStore+facet_wrap(~Store, nrow = 5)
As might be expected, the patterns for stores 3 and 12 are similar to those
noted in the analysis of bi-weekly hours per store. In addition, it becomes
apparent from looking at both bi-weekly hours graphics and wages graphics
that the number of hourly staff, and consequently wages paid, differs signifi-
cantly among stores. For example, the wages paid, and number of employees,
for store 15 is significantly greater than for store 18. This may relate to store
size (as we explore in another ADA below). But such variations may be in-
dicative of other issues. For example, employees at a store may be being paid
at rates above those authorized. So again, the auditor may make inquiries
about the reasons for these (and other unusual trends identified) and obtain
evidence to support, or contradict, management’s responses to the inquiries.

Practice Problems: Total Wages By Store


Generate the bi-weekly total wages per store for the following groups of
stores:14
1. Stores 21 through 40.
2. Stores 41 through 60.
3. Stores 61 through 79.

5.4.4 Visualize Payroll by Store Size (SqFt)


Objectives: The auditor will anticipate that the payroll expense for bigger
stores will likely be larger (although the relationship will not be linear). For
example, the larger stores are likely to have more cashiers and people stalking
shelves, but that can vary (e.g., larger stores might have more automated
check outs). Although the statistic would be open to interpretation, it would
14
The solution to this practice problem is on p. 106

Stratopoulos & Shields, Waterloo - 2018


5.4. ADA MODELING 99

be useful to account, at least in part, for the effect that store size has on
payroll. For example, the payroll expense per square foot for one or more
stores may be significantly different from what the auditor anticipates, based
on knowledge of Bibitor’s business. This might warrant more audit work on
payroll expense for these stores.
Leveraging prior knowledge from section 3.3.2, we classify stores in terms
of store size (SqFt). We create a new data set (dt1_d) that has all information
from dt1_c (i.e., store level payroll data for each payperiod), as well as the
variable that captures store size.

> dt1_d <- sqldf("SELECT dt1_c.*,


CASE
WHEN SqFt> 19500 THEN 'sqftOutlier'
WHEN SqFt> 10200 THEN 'Q4'
WHEN SqFt> 6400 THEN 'Q3'
WHEN SqFt> 4000 THEN 'Q2'
ELSE 'Q1'
END AS storeSize
FROM dt1_c")
> head(dt1_d)

Store BegDate status SqFt totalPmts totalHours


1 1 2015-07-01 hourlyEmpl 8300 6430.67 442
2 1 2015-07-16 hourlyEmpl 8300 7714.34 556
3 1 2015-08-01 hourlyEmpl 8300 6414.49 447
4 1 2015-08-16 hourlyEmpl 8300 6980.60 483
5 1 2015-09-01 hourlyEmpl 8300 6371.24 441
6 1 2015-09-16 hourlyEmpl 8300 6590.71 464
emplCount storeSize
1 6 Q3
2 9 Q3
3 6 Q3
4 6 Q3
5 6 Q3
6 6 Q3

Visualize Employee Count by Quartiles of Store Size


Using the function facet_wrap we create graphs for employee count of each
store grouped in terms of store size. Our first set of graphs captures the
group of smallest stores (Q1 ). The resulting graphs can be accessed from
the following link: https://goo.gl/xr1B54.

Stratopoulos & Shields, Waterloo - 2018


100 CHAPTER 5. PAYROLL

> g4_emplCountStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="Q1",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStoreSize+facet_wrap(~Store, nrow = 5)

Practice Problems: Employee Count by Quartiles of Store Size


Generate the bi-weekly employee count per store for the following groups of
stores:15
1. Stores in the second quartile (Q2).
2. Stores in the third quartile (Q3).
3. Stores in the fourth quartile (Q4), but not outliers.
4. Stores which are outliers in terms of size (sqftOutlier).
We noted above that the wages paid, and number of employees, for store
15 is significantly greater than for store 18. The graphics by store size show
that store 15 is a large store (Q4 -greater than 10,200 sq. ft) and that store
18 is a small store (Q1 -less than 4000 sq. ft.). Therefore, the fact that store
15 has more staff and higher wage costs than store 18 makes sense.

Visualize Wage Payments by Quartiles of Store Size


Generate bi-weekly total wage payments per store for stores that belong to
the first quartile (Q1) in terms of size.
> g4_totalPmtsStoreSize <-
ggplot(dt1_d[dt1_d$storeSize=="Q1",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStoreSize+facet_wrap(~Store, nrow = 5)
The resulting graphs can be accessed from the following link: https://goo.
gl/Ht9ab8

Practice Problems: Wage Payments by Quartiles of Store Size


Generate the bi-weekly total wage payments per store for the following groups
of stores:16
15
The solution to this practice problem is on p. 107
16
The solution to this practice problem is on p. 108

Stratopoulos & Shields, Waterloo - 2018


5.5. EVALUATION AND COMMUNICATION 101

1. Stores in the second quartile (Q2).


2. Stores in the third quartile (Q3).
3. Stores in the fourth quartile (Q4), but not outliers.
4. Stores which are outliers in terms of size (sqftOutlier).

5.5 Evaluation and Communication


In this case, it is assumed that the results of the ADA performed to as-
sess risks of material misstatement of payroll expenses did not result in the
identification of any matters that indicated an increased risk of material mis-
statement to which the auditor would need to respond. This result would be
communicated among appropriate members of the engagement team. There
would be no matters that require communication to management or the audit
committee.

5.6 Documentation
An example audit working paper showing documentation for this ADA is lo-
cated at https://docs.google.com/document/d/1g0j6rNluvhvulwXczbs9SVfeUdk7D-4Omqe1Ul6P9
edit?usp=sharing.

Stratopoulos & Shields, Waterloo - 2018


102 CHAPTER 5. PAYROLL

5.7 Solutions to Selected Practice Problems


5.7.1 Solution to Practice Problems - section 5.3.1
Solution to practice problem - verify ending payroll dates

> head(dtPayroll[order(dtPayroll$EndDate),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 04-33895 NA 10456.39
2: 2015-07-01 2015-07-15 04-43490 NA 14982.28
3: 2015-07-01 2015-07-15 04-61555 NA 65640.72
SalaryWagesPerPP Hours
1: 435.6829 NA
2: 624.2617 NA
3: 2735.0300 NA

> head(dtPayroll[order(-dtPayroll$EndDate),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2016-06-16 2016-06-30 04-33895 NA 10456.39
2: 2016-06-16 2016-06-30 04-43490 NA 14982.28
3: 2016-06-16 2016-06-30 04-61555 NA 65640.72
SalaryWagesPerPP Hours
1: 435.6829 NA
2: 624.2617 NA
3: 2735.0300 NA

Solution to practice problem - review range of salaries

> head(dtPayroll[order(dtPayroll$Salary),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 04-87464 NA 10346.5
2: 2015-07-16 2015-07-31 04-87464 NA 10346.5
3: 2015-08-01 2015-08-15 04-87464 NA 10346.5
SalaryWagesPerPP Hours
1: 431.1042 NA
2: 431.1042 NA
3: 431.1042 NA

> head(dtPayroll[order(-dtPayroll$Salary),3:9],3)

Stratopoulos & Shields, Waterloo - 2018


5.7. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 103

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 04-59749 NA 107822.9
2: 2015-07-16 2015-07-31 04-59749 NA 107822.9
3: 2015-08-01 2015-08-15 04-59749 NA 107822.9
SalaryWagesPerPP Hours
1: 4492.619 NA
2: 4492.619 NA
3: 4492.619 NA

Solution to practice problem - review range of hourly rates


> head(dtPayroll[order(dtPayroll$RatePerHour),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 02-28535 11.67 NA
2: 2015-07-01 2015-07-15 02-79898 11.67 NA
3: 2015-07-01 2015-07-15 02-67532 11.67 NA
SalaryWagesPerPP Hours
1: 385.11 33
2: 443.46 38
3: 501.81 43

> head(dtPayroll[order(-dtPayroll$RatePerHour),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 01-53372 17.98 NA
2: 2015-07-16 2015-07-31 01-53372 17.98 NA
3: 2015-08-01 2015-08-15 01-53372 17.98 NA
SalaryWagesPerPP Hours
1: 1222.64 68
2: 1168.70 65
3: 1546.28 86

Solution to practice problem - review range of hours


> head(dtPayroll[order(dtPayroll$Hours),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-16 2015-07-31 02-87680 11.67 NA
2: 2015-08-01 2015-08-15 02-60404 11.67 NA
3: 2015-12-16 2015-12-31 02-88771 11.67 NA
SalaryWagesPerPP Hours

Stratopoulos & Shields, Waterloo - 2018


104 CHAPTER 5. PAYROLL

1: 70.02 6
2: 81.69 7
3: 105.03 9

> head(dtPayroll[order(-dtPayroll$Hours),3:9],3)

BegDate EndDate empID RatePerHour Salary


1: 2015-07-01 2015-07-15 01-75069 16.48 NA
2: 2015-07-01 2015-07-15 01-93278 12.87 NA
3: 2015-07-01 2015-07-15 01-77721 15.81 NA
SalaryWagesPerPP Hours
1: 1450.24 88
2: 1132.56 88
3: 1391.28 88

5.7.2 Solution to Practice Problems - section 5.3.2


Solution to practice problem - List of Job Titles dates We can use
the function table to create a frequency count of employee titles as follows:17
> table(dtEmployees$Title)

ACCOUNTANT ADMINISTRATIVE ASSISTANT


5 4
ADMINISTRATIVE SUPERVISOR ATTORNEY
1 2
BEVERAGE MARKETING SPECIALIST BUILDING SERVICE WORKER
3 8
BUILDING SERVICES SUPERVISOR BUYER
1 2
CHIEF EXECUTIVE OFFICER CHIEF MARKETING OFFICER
1 1
CHIEF OPERATING OFFICER CLERK
1 2
DIRECTOR OF HUMAN RESOURCES DIRECTOR OF PURCHASING
1 1
FIXED ASSET MANAGER HUMAN RESOURCES TECHNICIAN
1 3
INTERNAL AUDITOR INVENTORY CONTROL MANAGER
17
If we add a second argument in the function table, it will generate a two-way fre-
quency table. We can generate relative frequencies (percentage of observations) as follows:
prob.table(table(dtEmployees$Title))

Stratopoulos & Shields, Waterloo - 2018


5.7. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 105

2 1
IT SPECIALIST LEGAL ASSISTANT
2 1
LEGAL SECRETARY PAYROLL OFFICER
1 1
REGIONAL STORE SUPERVISOR Retail Store Clerk I
7 629
Retail Store Clerk II Retail Store Clerk III
121 134
Retail Store Manager I Retail Store Manager II
14 13
Retail Store Manager III Retail Store Manager IV
30 39
SENIOR ACCOUNTANT Store Receiver
4 79
TRAINING AND DEVELOPMENT TRANSPORTATION SPECIALIST
4 1

5.7.3 Solution to Practice Problems - section 5.4.3


With the following script we are going to generate the graphs for stores
21 through 40. Notice, that we specify that store id has to be >= 21 and
<= 40. The resulting graphs can be accessed from the following link: https:
//goo.gl/AN8VCY.
> g4_emplCountStore <-
ggplot(dt1_c[dt1_c$Store>=21&dt1_c$Store<=40,],
aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStore+facet_wrap(~Store, nrow = 5)
We repeat for stores 41 through 60. The resulting graphs can be accessed
from the following link: https://goo.gl/VnBxT9.
> g4_emplCountStore <-
ggplot(dt1_c[dt1_c$Store>=41&dt1_c$Store<=60,],
aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStore+facet_wrap(~Store, nrow = 5)

Stratopoulos & Shields, Waterloo - 2018


106 CHAPTER 5. PAYROLL

Finally, we have the graph for stores 61 through 79. The resulting graphs
can be accessed from the following link: https://goo.gl/531ruu.

> g4_emplCountStore <-


ggplot(dt1_c[dt1_c$Store>=61,], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStore+facet_wrap(~Store, nrow = 5)

5.7.4 Solution to Practice Problems - section 5.4.3


Script for stores 21 through 40. The resulting graphs can be accessed from
the following link: https://goo.gl/bA4UMo.

> g4_totalPmtsStore <-


ggplot(dt1_c[dt1_c$Store>=21&dt1_c$Store<=40,], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStore+facet_wrap(~Store, nrow = 5)

Script for stores 41 through 60. The resulting graphs can be accessed
from the following link: https://goo.gl/zanEjm.

> g4_totalPmtsStore <-


ggplot(dt1_c[dt1_c$Store>=41&dt1_c$Store<=60,], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStore+facet_wrap(~Store, nrow = 5)

Script for stores 61 through 79. The resulting graphs can be accessed
from the following link: https://goo.gl/YpyThd.

> g4_totalPmtsStore <-


ggplot(dt1_c[dt1_c$Store>=61,], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStore+facet_wrap(~Store, nrow = 5)

Stratopoulos & Shields, Waterloo - 2018


5.7. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 107

5.7.5 Solutions to Practice Problems - section 5.4.4


> g4_emplCountStoreSize <-
ggplot(dt1_d[dt1_d$storeSize=="Q2",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/mhwP7M

> g4_emplCountStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="Q3",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/DE5nRW

> g4_emplCountStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="Q4",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/B7MW8J

> g4_emplCountStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="sqftOutlier",], aes(x=BegDate)) +
geom_point(aes(y=emplCount))+
geom_line(aes(y=emplCount))+
xlab("Beginning Date")+ylab("Employee Count")
> g4_emplCountStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/neBN4R

Stratopoulos & Shields, Waterloo - 2018


108 CHAPTER 5. PAYROLL

5.7.6 Solutions to Practice Problems - section 5.4.4


> g4_totalPmtsStoreSize <-
ggplot(dt1_d[dt1_d$storeSize=="Q2",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/gnoYiz

> g4_totalPmtsStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="Q3",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/TT9Lqm

> g4_totalPmtsStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="Q4",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/QVGqqK

> g4_totalPmtsStoreSize <-


ggplot(dt1_d[dt1_d$storeSize=="sqftOutlier",], aes(x=BegDate)) +
geom_point(aes(y=totalPmts))+
geom_line(aes(y=totalPmts))+
xlab("Beginning Date")+ylab("Total Payments")
> g4_totalPmtsStoreSize+facet_wrap(~Store, nrow = 5)

The resulting graphs can be accessed from the following link: https://goo.
gl/A8qwGF

Stratopoulos & Shields, Waterloo - 2018


Part III

Appendix

109

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Appendix A

Introduction to R

A.1 Introduction
R is a very powerful and versatile software package. In very general terms,
R can be used to extract, transform, and analyze structured (e.g., sales and
inventory data from a company’s enterprise system) and unstructured data
(e.g., Tweets, emails, blogs).
While the list of R applications is very extensive, here are some examples
of what you will learn in this introduction to ADA: run SQL queries in order
to extract and transform data; organize data (e.g., pivot tables); run sim-
ple statistical analysis, such as generating descriptive statistics, identifying
outliers, and perform hypothesis testing.
Chances are that you have heard that the learning curve of R is relatively
steep, and that some of the things that you can do with R you can also do
with other packages such as Excel or Tableau. You may wonder, why bother
with R. There are at least a couple of reasons on why it is worth your time
to make this investment.
• First, the spectrum of applications that you can work on is very large.
Figure A.1,1 shows the trade-off between the difficulty and complexity
of Excel and R. As you can see the difficulty (learning curve) of Excel is
very low. The red Excel line grows very slowly along the horizontal axis.
The complexity of Excel (i.e., what you can do with Excel) becomes
almost vertical, which means there is a limit. Looking on the R line, we
can see that it becomes steep very quickly. This means that; initially,
it is more difficult to learn. However, the list of what you can do with
R keeps going much longer after Excel has reached its limit.
1
The figure was prepared by Gordon Shotwell and it is available from the following
URL: http://blog.yhat.com/posts/R-for-excel-users.html.

111

Stratopoulos & Shields, Waterloo - 2018


112 APPENDIX A. INTRODUCTION TO R

• Second, once you have learned R, it becomes relatively easier to work


with any other package (e.g., SAS, STATA, SPSS, MATLAB, Octave).

Figure A.1: Excel vs R Trade Off

A.2 R Installation
R is pre-installed in some computers. Check the applications in your Mac
or All Programs in your Windows machine. If it is not per-installed, you
can download and install from the appropriate URL (shown below). For the
basic installation your can simply follow the directions on your screen and
accept the default settings.

• For Macs from:http://cran.r-project.org/bin/macosx/


• For Windows from: http://cran.r-project.org/bin/windows/base/

Install R in your machine before you move to the next topic!

Stratopoulos & Shields, Waterloo - 2018


A.3. RSTUDIO 113

A.3 RStudio
While we can run our analysis from R, the interface of RStudio is more
user friendly. RStudio is an integrated development environment (IDE) that
will let us see all components of our R project in an integrated interface.
Download RStudio from http://www.rstudio.com/. Please, remember that
first we install R, and then RStudio.

Figure A.2: RStudio Interface

The interface of RStudio (shown in Figure A.2) is divided into four panels
(areas):2
1. Source, this is where we write script that we want to save. This is
helpful when we work on projects that we may want to revisit at a later
point. We can create a new R script file by selecting File > New File > R
Script. The new file will show as untitled1 in the source pane.
2. The Console plays a dual role. This is where we execute script inter-
actively, and we see the output/results of our script. Executing interactively
2
If the interface looks different than the one shown in Figure A.2, we can change it
by selecting preferences in Mac or Tools - Global Options in Windows and then selecting
Pane Layout. If we want our interface to match Figure A.2, we can use the drop down
menu in each pane and select source for the upper left pane, console for the upper right,
etc.

Stratopoulos & Shields, Waterloo - 2018


114 APPENDIX A. INTRODUCTION TO R

means that when we type an instruction in the console, and we hit return,
the line is immediately executed and the output will be shown in the console.
If we execute a line or an entire script in the source area, the output will
shown in the console.
3. The Files, Plots, Packages, Help, Viewer area serves multiple needs.
For example, if our script contains instructions for the creation of a graph,
the output will show up in the Plots. If we need to install specific packages
that would allow us to execute some sophisticated functions, we do this from
the Packages area. We can view the files in our working directory or access
the built in Help functionality.
4. The Environment area shows the data set(s) currently open and the
variables in each one of these data sets. The History as the name indicates
keeps track of every single line of script that we have entered through the
console.

A.4 R Packages
R has a collection of ready to use programs, called packages. For example,
we can make some very powerful graphs with the package ggplot2 , we can
generate detailed descriptive statistics with the package psych, run SQL
queries from within R using sqldf , and perform sentiment analysis based on
Twitter data using twitteR and stringr.
There are a couple of ways that we can install these packages. First,
we can use the console to type the command install.packages(). For
example, the following line would install the psych package:

i n s t a l l . p a c k a g e s ( ‘ psych ’ )

Alternatively, we can click Packages > Install and select the package from
the new window/drop down menu (see Figure A.3).
Once the packages have been installed, we can load them (i.e., include
them in our script), using the library function. For example, the following
line would load the psych.

l i b r a r y ( psych )

Please keep in mind that we install a package only once. However, we need
to load the appropriate package using the function library() every time we
need to leverage the functionality of the specific package in our script.

Stratopoulos & Shields, Waterloo - 2018


A.5. R BASICS 115

Figure A.3: Install Packages in R

A.5 R Basics
A.5.1 Operations
In its simplest form, we can use R as a calculator to perform basic operations.
For example, in the console area we can write (See also Figure A.4):

> 50+20

It will return

[1] 70

> 80-30

It will return

[1] 50

> 20*3

It will return

Stratopoulos & Shields, Waterloo - 2018


116 APPENDIX A. INTRODUCTION TO R

[1] 60

> 54/9

It will return

[1] 6

We can calculate powers as follows:

> 2^3

[1] 8

or

> 2**3

[1] 8

Figure A.4: Console: Basic Operations

We can combine operations. We need to be careful to use parentheses in


the proper places.

Stratopoulos & Shields, Waterloo - 2018


A.5. R BASICS 117

> (2^3)+(80-50)

[1] 38

We can ask R to create a sequence of values. For example, if we want to get


the numbers 1 through 10, we can do this as follows:

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

A.5.2 Defining Variables


To define a new variable (object) in R, we use the less than and minus symbol
(<-). For example, with the following commands, we say that variable x is
equal to 5 and variable y is 9, and then we perform operations on the two
variables:

> x<- 5
> y<- 9
> x*y

[1] 45

> x-y

[1] -4

> x+y

[1] 14

We can combine existing variables to create new variables using the c().
For example, we can create a new variable that combines the value of the
variable x and y as follows:

> z <- c(x,y)


> z

As you can see the new variable z has two data points.

[1] 5 9

We continue adding more variables and more values.

Stratopoulos & Shields, Waterloo - 2018


118 APPENDIX A. INTRODUCTION TO R

> w <- c(7,z,7,x,7,y)


> w
The new variable z, takes the following values:
[1] 7 5 9 7 5 7 9
NB When creating variables in R, we need to keep two things in mind:
1. R is case sensitive. This means that variable lower case x is different
than variable upper case X.
2. When we specify a new variable in R, the name cannot start with
a number. If the name of the new variable that we want to create
starts with a number (e.g., 123x<- 5), R will return the following error
message: Error: unexpected symbol in "123x"

A.5.3 Cleaning Console and Environment


We can remove all content shown in the console area by clicking
• CTRL+L.
Notice that in the problems that we have been working on, we keep using
x as the name of our variable. It is a good practice after we are done with
a problem, and before we move to the next one, to clear the data sets and
variables from the R environment. We do this to avoid the problem that
we may end up using the wrong variable specification. We can clean the R
environment by simply clicking the button showing the picture of the broom
(Figure A.5).

A.6 Working with R files


The console area is good for working interactively with R. However, some-
times, we may want to start working on a problem, save our work, and come
back and finish this later. In order to do this, we go into the source area and
select/create a new R script (See Figure A.6) or select File > New File >
R Script, or simply use the shortcut Ctrl+Shift+N. In the Source area, we
can start typing our script line-by-line.
When we hit Enter in the source area, it does not show any results. It
simply moves the cursor to the next line. If we want to execute a line at a
time, we can do this by clicking on the Run icon or the shortcut Ctrl+Enter.
If we want to run all lines in our R script, we can do this clicking on the
Source icon and select Source with Echo from the drop down menu (Figure

Stratopoulos & Shields, Waterloo - 2018


A.6. WORKING WITH R FILES 119

Figure A.5: RStudio - Clear Datasets

A.7). Once, we are done we can select save or save as and select to save
the file in our working directory on our computer.

Figure A.6: Create R Script

When we create R files, it is a good idea to add comments that either


explain what we are trying to do or simply provide a reminder related to
the function that we are using. In R, we can add a comment by using the
number/pound # sign. Everything that follows the # is just a comment and
does not affect the functionality of our R script. We can add a comment
either at the beginning of a new line or after a command.

Stratopoulos & Shields, Waterloo - 2018


120 APPENDIX A. INTRODUCTION TO R

Figure A.7: Writing R Script in Source Area

A.7 Using Functions


R has built-in functions that automate tasks. For example, we have already
seen that 1:10 will generate a sequence of numbers from 1 to 10. An alterna-
tive approach would be to use one of the built-in functions in R. In this case,
we need to use the function seq(), which stands for sequence. Within the
function seq(), we need to specify the beginning (from), end (to), as well
as the increment (by). If we don’t specify the increment the default value is
1. For example, if we want numbers from 1.0 to 2.0 at increments for .2, we
write this as follows:

> seq(1, 2, 0.2)

[1] 1.0 1.2 1.4 1.6 1.8 2.0

If we want to get help on any R function, we type the question mark


(?) followed by the key word for the function. For example, a search for seq
would be as follows:?seq.

A.7.1 Generating basic descriptive statistics


In the following example, we are going to create a new variable based on a
sequence of even numbers from 2 to 20, and use the length() function to find
the number of observations, sum() function to find the summation of these
numbers, min() for the minimum, max() for the maximum, and mean() to
generate the average of these values.

Stratopoulos & Shields, Waterloo - 2018


A.7. USING FUNCTIONS 121

> x=seq(2,20,2)
> x

[1] 2 4 6 8 10 12 14 16 18 20

> length(x)

[1] 10

> sum(x)

[1] 110

> min(x)

[1] 2

> max(x)

[1] 20

> mean(x)

[1] 11

Stratopoulos & Shields, Waterloo - 2018


Stratopoulos & Shields, Waterloo - 2018
Alphabetical Index

boxplot, 39 [,] all rows and certain


columns, 12
clear data sets, 118 [,] certain rows and all
foreign key, 5 columns, 12
[,] certain rows and certain
ggplot columns, 12
aes(), 39 [] subseting data, 10
color=, 51 # comment sign, 119
geom_boxplot(), 39 %in%, 50
geom_line, 48 AND (&), 40
geom_point, 47 as.numeric, 27
ggplot(), 39 Basics, 115
outlier.color, 39 c() combine or concatenate,
qplot, 16 117
scale_x_continuous, 48 comments, 119
xlab for labeling x-axis, 16
data.frame, 9
ylab for labeling y-axis, 16
data.table, 9
histogram, 15 date variable, 26
Defining variables, 117
Inter Quartile Range, 38 format day, 26
IQR, 38 format month, 26
lower whisker, 38 format year, 26
fread(), 8
primary key, 5 head(), 10
ifelse, 40
quartile, 14
length(), 120
R max(), 120
: sequence of values, 117 mean(), 120
<- define object, 117 median(), 14
[,] all rows and all columns, 11 min(), 120

123

Stratopoulos & Shields, Waterloo - 2018


124 ALPHABETICAL INDEX

names(), 9 Packages, 114


OR (|), 40 Plots, 114
order ascending, 12 remove data sets, 118
order descending, 13 remove variables, 118
order(), 12 Run, 118
power, 116 Script, 118
quantile, 14 Source, 113
rename variable, 25 Source with Echo, 118
seq(), 120
str(), 9 SQL, 19
subsets of data, 10 AS name, 23
sum(), 120 CASE ... END, 33
summary(), 15 categorical variables, 33
tail(), 10 DISTINCT, 22
R packages, 114 FROM, 21
ggplot2, 15, 114 GROUP BY, 23
psych, 114 HAVING, 24
sqldf, 20, 114 INNER JOIN, 30
stringr, 114 LEFT JOIN, 30
twitteR, 114 merging tables, 29
RStudio ORDER BY, 22
File, 118 ORDER BY DESC, 22
clear console, 118 SELECT, 21
clear environment, 118 table.*, 21
clear variables, 118 WHERE, 22
Console, 113 Structured Query Language, 19
CTRL+L, 118 summary statistics, 15
Files, 114 time series, 45
Help, 114 time series graph, 47
help, 120
install.packages, 114 upper whisker, 38

Stratopoulos & Shields, Waterloo - 2018

You might also like